关于作者

About the Author

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

图像

Avinash Ramachandran 从事视频压缩工作超过 15 年,作为连续创业技术专家、创新者和演讲者。他在视频编码方面的工作为运动估计、运动补偿和比特率控制算法方面的多项专利做出了贡献。他目前正在 NGCodec Inc. 使用 H.265、VP9 和 AV1 编解码器开发下一代算法和产品。作为 IEEE 的高级成员,他在印度马德拉斯印度理工学院完成了数字信号处理硕士学位,并拥有 MBA 学位来自加拿大理查德艾维商学院。

Avinash Ramachandran has been working in video compression for over 15 years as a serial startup technologist, innovator, and speaker. His work on video coding has contributed to several patents in motion estimation, motion compensation and bitrate control algorithms. He is currently developing next-generation algorithms and products with H.265, VP9 and AV1 codecs at NGCodec Inc. A senior member of IEEE, he completed his Masters in Digital Signal Processing from the Indian Institute of Technology Madras in India and holds an MBA from the Richard Ivey School of Business in Canada.

 

 

 

 

 

 

 

 

 

 

解码编码

DECODE TO ENCODE

版权所有 © 2019 Avinash Ramachandran

Copyright © 2019 Avinash Ramachandran

版权所有。

All rights reserved.

美国加利福尼亚州

California, USA

本作品受版权保护。未经版权所有者事先书面许可,不得以任何形式或通过任何方式(电子、机械、影印、录音或其他方式)复制、存储在检索系统中或传播本出版物的任何部分。

This work is subject to copyright. No parts of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the copyright owner.

本书出售的条件是,未经出版商事先同意,不得通过交易或其他方式以任何形式的装订或封面以外的形式出借、转售、出租或以其他方式流通发布且没有类似条件,包括对后续购买者施加此条件。在任何情况下都不得复印本书的任何部分进行转售。

This book is sold subject to the condition that it shall not, by way of trade or otherwise, be lent, resold, hired out, or otherwise circulated without the publisher’s prior consent in any form of binding or cover other than that in which it is published and without a similar condition including this condition being imposed on the subsequent purchaser. Under no circumstances may any part of this book be photocopied for resale.

 

 

信息联系

For information contact

 

 

阿维纳什•拉马钱德兰

Avinash Ramachandran

info@decodetoencode.com

info@decodetoencode.com

www.decodetoencode.com

www.decodetoencode.com

 

 

第一版 2018 年 11 月

First Edition November 2018

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

感谢我的妻子Veena ,她一直深情地接纳我,支持我,支持我。

To my wife Veena, who’s always lovingly accepted me, supported me and stood by me.

 

 

关于作者

About the Author

图列表

List of Figures

表格列表

List of Tables

缩略词

Acronyms

前言

Preface

致谢

Acknowledgements

本书的组织

Organization of the Book

1数字视频简介      

1      Introduction to Digital Video

1.1隔行视频      

1.1      Interlaced Video

1.2抽样      

1.2      Sampling

1.3色彩空间      

1.3      Color Spaces

1.4位深度      

1.4      Bit Depth

1.5 高动态范围      

1.5      HDR

1.5.1色彩空间      

1.5.1      Color Space

1.5.2位深度      

1.5.2      Bit Depth

1.5.3传递函数      

1.5.3      Transfer Function

1.5.4元数据      

1.5.4      Metadata

1.5.5 HDR风景      

1.5.5      HDR Landscape

1.6总结      

1.6      Summary

1.7注意事项      

1.7      Notes

2视频压缩      

2      Video Compression

2.1景观      

2.1      Landscape

2.2为什么需要视频压缩?      

2.2      Why is Video Compression needed?

2.3视频是如何压缩的?      

2.3      How is video compressed?

2.3.1空间像素冗余      

2.3.1      Spatial Pixel Redundancy

2.3.2利用时间像素相关性      

2.3.2      Exploiting Temporal Pixel Correlation

2.3.3熵编码      

2.3.3      Entropy Coding

2.3.4利用心理视觉冗余      

2.3.4      Exploiting Psycho-Visual Redundancies

2.3.5 8 位与 10 位编码      

2.3.5      8-bit versus 10-bit encoding

2.4总结      

2.4      Summary

2.5注意事项      

2.5      Notes

3编解码器的演变      

3      Evolution of Codecs

3.1编码研究的主要突破      

3.1      Key Breakthroughs in Encoding Research

3.1.1信息论(熵编码)      

3.1.1      Information Theory (Entropy Coding)

3.1.2预测      

3.1.2      Prediction

3.1.3变换编码      

3.1.3      Transform coding

3.2视频编码标准的演进      

3.2      Evolution of Video Coding Standards

3.2.1时间表和发展      

3.2.1      Timelines and Developments

3.2.2 MPEG2、H.264、H.265对比表      

3.2.2      Comparison Table of MPEG2, H.264, and H.265

3.3总结      

3.3      Summary

3.4注意事项      

3.4      Notes

4视频编解码器架构      

4      Video Codec Architecture

4.1混合视频编码架构      

4.1      Hybrid Video Coding Architecture

4.1.1帧内编码      

4.1.1      Intra Frame Encoding

4.1.2帧间编码      

4.1.2      Inter Frame Encoding

4.1.3图片组 (GOP) 结构      

4.1.3      Group of Pictures (GOP) Structures

4.2基于块的预测      

4.2      Block-based Prediction

4.3切片和瓦片      

4.3      Slices and Tiles

4.4隔行与逐行扫描      

4.4      Interlaced versus Progressive Scan

4.5总结      

4.5      Summary

4.6注意事项      

4.6      Notes

5帧内预测      

5      Intra Prediction

5.1预测过程      

5.1      The Prediction Process

5.2变换块和帧内预测      

5.2      Transform Blocks and Intra Prediction

5.3跨编解码器比较      

5.3      Comparison across Codecs

5.4总结      

5.4      Summary

6国米预测      

6      Inter Prediction

6.1基于运动的预测      

6.1      Motion-based Prediction

6.1.1运动补偿预测      

6.1.1      Motion Compensated Prediction

6.2运动估计算法      

6.2      Motion Estimation Algorithms

6.3子像素插值      

6.3      Sub Pixel Interpolation

6.3.1 HEVC 中的子像素插值      

6.3.1      Sub-pixel Interpolation in HEVC

6.4运动矢量预测      

6.4      Motion Vectors Prediction

6.5总结      

6.5      Summary

6.6注意事项      

6.6      Notes

7残差编码      

7      Residual Coding

7.1什么是频率?      

7.1      What are frequencies?

7.2如何将一幅图像分解成它的频率?      

7.2      How Can an Image be Broken Down into its Frequencies?

7.2.1为什么转向频域?      

7.2.1      Why move to the frequency domain?

7.2.2转换选择标准      

7.2.2      Criteria for Transform Selection

7.2.3离散余弦变换      

7.2.3      Discrete Cosine Transform

7.3量化      

7.3      Quantization

7.3.1基本概念      

7.3.1      The Basic Concepts

7.3.2量化矩阵      

7.3.2      Quantization Matrix

7.3.3视频量化      

7.3.3      Quantization in Video

7.3.4如何分配 QP 值?      

7.3.4      How can QP values be assigned?

7.4重新排序      

7.4      Reordering

7.5运行级对编码      

7.5      Run Level Pair Encoding

7.6总结      

7.6      Summary

8熵编码      

8      Entropy Coding

8.1信息论概念      

8.1      Information Theory Concepts

8.1.1熵的概念      

8.1.1      The Concept of Entropy

8.1.2如何处理可能性或概率?      

8.1.2      How are the likelihoods or probabilities handled?

8.2上下文自适应二进制算术编码      

8.2      Context Adaptive Binary Arithmetic Coding

8.2.1二值化      

8.2.1      Binarization

8.2.2上下文建模      

8.2.2      Context Modeling

8.2.3算术编码      

8.2.3      Arithmetic Coding

8.3总结      

8.3      Summary

8.4注意事项      

8.4      Notes

9过滤      

9      Filtering

9.1为什么需要环路滤波?      

9.1      Why is In-loop Filtering Needed?

9.2去块滤波器      

9.2      Deblocking Filter

9.2.1解块过程      

9.2.1      Deblocking Process

9.2.2过滤示例      

9.2.2      Filtering Example

9.3 SAO      

9.3      SAO

9.3.1边缘偏移模式      

9.3.1      Edge Offset Mode

9.3.2波段偏移模式      

9.3.2      Band Offset Mode

9.3.3 SAO 实现      

9.3.3      SAO Implementation

9.4总结      

9.4      Summary

9.5注意事项      

9.5      Notes

10模式决策和速率控制      

10      Mode Decision and Rate Control

10.1约束      

10.1      Constraints

10.2失真测量      

10.2      Distortion Measures

10.2.1绝对差之和      

10.2.1      Sum of Absolute Differences

10.2.2 SATD(绝对变换差异之和)      

10.2.2      SATD (Sum of Absolute Transform Differences)

10.3编码问题的表述      

10.3      Formulation of the Encoding Problem

10.4率失真优化      

10.4      Rate Distortion Optimization

10.5速率控制概念      

10.5      Rate Control Concepts

10.5.1位分配      

10.5.1      Bit Allocation

10.5.2速率控制中的 RDO      

10.5.2      RDO in Rate Control

10.5.3速率控制机制总结      

10.5.3      Summary of Rate Control Mechanism

10.6自适应量化(AQ)      

10.6      Adaptive Quantization (AQ)

10.7总结      

10.7      Summary

10.8注意事项      

10.8      Notes

11种编码模式      

11      Encoding Modes

11.1 VBR编码      

11.1      VBR Encoding

11.2 CBR编码      

11.2      CBR Encoding

11.3 CRF编码      

11.3      CRF Encoding

11.4何时使用 VBR 或 CBR?      

11.4      When to Use VBR or CBR?

11.4.1视频直播      

11.4.1      Live Video Broadcasting

11.4.2实时互联网视频流      

11.4.2      Live Internet Video Streaming

11.4.3视频点播流      

11.4.3      Video on Demand Streaming

11.4.4存储      

11.4.4      Storage

11.5总结      

11.5      Summary

12性能      

12      Performance

12.1客观视频质量指标      

12.1      Objective Video Quality Metric

12.1.1峰值信噪比 (PSNR)      

12.1.1      Peak Signal-to-Noise Ratio (PSNR)

12.1.2结构相似性(SSIM)      

12.1.2      Structural Similarity (SSIM)

12.1.3视频多方法评估融合(VMAF)      

12.1.3      Video Multimethod Assessment Fusion (VMAF)

12.2编码器实现      

12.2      Encoder Implementations

12.2.1 H.264 编码器      

12.2.1      H.264 Encoder

12.2.2 H.265 编码器      

12.2.2      H.265 Encoder

12.2.3 VP9 编码器      

12.2.3      VP9 Encoder

12.3视频质量评估      

12.3      Video Quality Assessments

12.4总结      

12.4      Summary

12.5注意事项      

12.5      Notes

13视频进展      

13      Advances in Video

13.1每标题编码器优化      

13.1      Per-title Encoder Optimization

13.2机器学习      

13.2      Machine Learning

13.2.1用于视频编码优化的 ML 工具      

13.2.1      ML Tools for Video Coding Optimization

13.3新兴的AV1编解码器      

13.3      Emerging AV1 Codec

13.4虚拟现实和 360° 视频      

13.4      Virtual Reality and 360° Video

13.4.1什么是 360° 视频?      

13.4.1      What is 360° Video?

13.5总结      

13.5      Summary

13.6注意事项      

13.6      Notes

资源

Resources

指数

INDEX

图列表

List of Figures

图 1:扫描隔行扫描视频中的奇数行和偶数行      3个

Figure 1: Scanning odd and even lines in interlaced video      3

图 2:图片的水平和垂直分辨率。      4个

Figure 2: Horizontal and vertical resolutions of a picture.      4

图 3:图片中的亮度和色度空间采样。      5个

Figure 3: Luminance and chrominance spatial sampling in a picture.      5

图 4:视觉场景的空间和时间采样。      5个

Figure 4: The spatial and temporal sampling of a visual scene.      5

图 5:一帧视频缩小到不同的分辨率[1]       6个

Figure 5: A frame of video downscaled to different resolutions [1].      6

图 6:帧中的 16x16 像素块说明了 256 个正方形像素[1]。      7

Figure 6: A 16x16 block of pixels in a frame that illustrates the 256 square pixels [1].      7

图 7:彩色图像及其三个组成部分[1]       9

Figure 7: A color image and its three constituent components [1].      9

图 8:像素值的 8 位和 10 位表示。      10

Figure 8: 8-bit and 10-bit representation of pixel values.      10

图 9:Rec. 中的颜色范围。709(左)与 Rec. 2020 年(右)。      13

Figure 9: Color ranges in Rec. 709 (left) versus Rec. 2020 (right).      13

图 10:akiyo 序列[2]的一帧中像素的空间相关性图示      19

Figure 10: Illustration of spatial correlation in pixels in one frame of the akiyo sequence [2].      19

图 11:以像素为单位的空间相关性图示。      20

Figure 11: Illustration of spatial correlation in pixels.      20

图 12:akiyo 序列的连续帧中时间相关性的图示[2]       21

Figure 12: Illustration of temporal correlation in successive frames of the akiyo sequence [2].      21

图 13:像素的 4:2:2 和 4:2:0 子采样。      23

Figure 13: 4:2:2 and 4:2:0 subsampling of pixels.      23

图 14:HVS 对低频天空和高频树木区域的敏感性。      23

Figure 14: HVS sensitivity to the low-frequency sky, and high-frequency trees, areas.      23

图 15:细节缺失在大而平滑的区域(如天空)中更为突出[3]       24

Figure 15: Lack of details are more prominent in large and smooth areas like the sky [3].      24

图 16:Manfred Schroeder 的预测编码技术。      27

Figure 16: Manfred Schroeder’s predictive coding techniques.      27

图 17:Netravali 和 Stuller 在变换域中的运动补偿预测。      28

Figure 17: Netravali and Stuller’s motion compensated prediction in the transform domain.      28

图 18:编码和解码的过程。      38

Figure 18: The process of encoding and decoding.      38

图 19:P 帧和 B 帧编码示意图。      41

Figure 19: Illustration of P frame and B frame encoding.      41

图 20:在输入中出现时按显示顺序排列的帧序列。      42

Figure 20: Sequence of frames in display order as they appear in the input.      42

图 21:出现在比特流中时按编码/解码顺序排列的帧序列。      42

Figure 21: Sequence of frames in encode/decode order as they appear in the bitstream.      42

图 22:不同帧类型的帧大小图示。      43

Figure 22: Illustration of frame size across different frame types.      43

图 23:M=5 和 N=15 的 GOP 图示。      45

Figure 23: Illustration of GOP with M=5 and N=15.      45

图 24:没有分层 B 预测的 P 和 B 参考帧。      45

Figure 24: P and B reference frames without hierarchical B prediction.      45

图 25:分层 B 参考框架。      46

Figure 25: Hierarchical B reference frames.      46

图 26:基于块的编码器的框图。      47

Figure 26: Block diagram of a block-based encoder.      47

图 27:VP9 中的 64x64 超级块被递归地划分为子分区。      48岁

Figure 27: A 64x64 superblock in VP9 is partitioned recursively in to sub partitions.      48

图 28:从 64x64 块到 4x4 块的递归分区。      49

Figure 28: Recursive partitioning from 64x64 blocks down to 4x4 blocks.      49

图 29:将图片划分为块和块的子分区。      49

Figure 29: Partition of a picture in to blocks and sub partition of blocks.      49

图 30:一个视频帧被分成三个片段。      52

Figure 30: A video frame is split in to three slices.      52

图 31:将一个框架拆分为 4 个独立的列块。      52

Figure 31: Splitting a frame in to 4 independent column tiles.      52

图 32:用于帧内预测的相邻块的图示。      56

Figure 32: Illustration of neighbor blocks used for intra prediction.      56

图 33:使用不同方向模式从相邻块进行帧内预测:(a) 水平,(b) 垂直,(c) 对角线。      58

Figure 33: Intra prediction from neighbor blocks by using different directional modes: (a) horizontal, (b) vertical, (c) diagonal.      58

图 34:H.265 中定义的帧内预测角度模式。      59

Figure 34: Intra prediction angular modes defined in H.265.      59

图 35:原始原始来源。      60

Figure 35: Original raw source.      60

图 36:由帧内预测像素形成的图像。      61

Figure 36: Image formed from intra predicted pixels.      61

图 37:通过减去原始像素值和预测像素值形成的残差图像。      62

Figure 37: Residual image formed by subtracting original and predicted pixel values.      62

图 38:使用运动搜索导出运动矢量。      65

Figure 38: Deriving the motion vector using motion search.      65

图 39: 输入的第 56 帧 - 斯德哥尔摩 720p YUV 序列。      67

Figure 39: Frame 56 of input - stockholm 720p YUV sequence.      67

图 40: 运动补偿预测帧。      67

Figure 40: Motion compensated prediction frame.      67

图 41: 来自参考帧的运动矢量。      68

Figure 41: Motion vectors from reference frames.      68

图 42: 运动补偿残差帧[1]       68

Figure 42: Motion compensated residual frame [1].      68

图 43:双向预测图示。      69

Figure 43: Illustration of bidirectional prediction.      69

图 44:视频序列中的淡入淡出。      70

Figure 44: Fades in video sequences.      70

图 45:基于块的运动估计。      71

Figure 45: Block-based motion estimation.      71

图 46:在 64x64 块周围的范围为 +/- 128 像素的搜索区域。      72

Figure 46: Search area with range +/- 128 pixels around a 64x64 block.      72

图 47:运动估计的三步搜索。      73

Figure 47: Three step search for motion estimation.      73

图 48:整数和亚像素预测示例。      74

Figure 48: Example of integer and sub-pixel prediction.      74

图 49:HEVC 中亮度插值的像素位置。      76

Figure 49: Pixel positions for luma Interpolation in HEVC.      76

图 50:相邻块的运动向量高度相关[2]       78

Figure 50: Motion vectors of neighboring blocks are highly correlated [2].      78

图 51:图像中高频和低频区域的图示。      82

Figure 51: Illustration of high frequency and low frequency areas in an image.      82

图 52:32x32 块的剩余样本值。      84

Figure 52: Residual sample values for a 32x32 block.      84

图 53:转换后的能量压缩。      85

Figure 53: Energy compaction after transforms.      85

图 54:残差样本:左上角的 8x8 块。      87

Figure 54: Residual samples: top-left 8x8 block.      87

图 55:残差样本的 8x8 DCT 系数。      87

Figure 55: 8x8 DCT coefficients of the residual samples.      87

图 56:灵活使用不同的变换大小。      88

Figure 56: Flexible use of different transform sizes.      88

图 57:量化过程。      90后

Figure 57: Process of Quantization.      90

图 58:量化矩阵。      91

Figure 58: Quantization matrix.      91

图 59:使用量化矩阵进行量化。      91

Figure 59: Quantization using a quantization matrix.      91

图 60:预测后的 16x16 残差值块。      93

Figure 60: A 16x16 block of residual values after prediction.      93

图 61:经过 16x16 变换后的 16x16 块。      93

Figure 61: The 16x16 block after undergoing a 16x16 transform.      93

图 62:经过量化后的 16x16 块。      94

Figure 62: The 16x16 block after undergoing quantization.      94

图 63:逆量化后的 16x16 块。      94

Figure 63: The 16x16 block after inverse quantization.      94

图 64:经过 16x16 逆变换后重建的 16x16 块。      95

Figure 64: The reconstructed 16x16 block after inverse 16x16 transform.      95

图 65:QP 30 情况下经过 16x16 逆变换后重建的 16x16 块。      95

Figure 65: The reconstructed 16x16 block after inverse 16x16 transform in QP 30 case.      95

图 66:QP 20 情况下经过 16x16 逆变换后重建的 16x16 块。      96

Figure 66: The reconstructed 16x16 block after inverse 16x16 transform in QP 20 case.      96

图 67:量化过程的影响。      97

Figure 67: Effects of quantization process.      97

图 68:8x8 量化系数块。      98

Figure 68: 8x8 block of quantized coefficients.      98

图 69:8x8 块系数的之字形扫描顺序。      99

Figure 69: Zig-zag scanning order of coefficients of 8x8 block.      99

图 70:VP9 中 8x8 块系数的默认扫描顺序。      100

Figure 70: Default scanning order of coefficients of 8x8 block in VP9.      100

图 71:上下文自适应二进制算术编码器的框图。      108

Figure 71: Block diagram of context adaptive binary arithmetic coder.      108

图 72:上下文建模和算术编码。      113

Figure 72: Context modeling and arithmetic Coding.      113

图 73:二进制算术编码过程。      115

Figure 73: Process of binary arithmetic coding.      115

图 74:使用算术编码对示例序列进行编码的图示。      116

Figure 74: Illustration of coding a sample sequence using arithmetic coding.      116

图 75:使用不同的上下文概率对样本序列进行编码。      117

Figure 75: Coding the sample sequence using different context probabilities.      117

图 76:算术编码中动态概率适应的图示。      117

Figure 76: Illustration of dynamic probability adaptation in arithmetic coding.      117

图 77:解码算术编码比特流的图示。      118

Figure 77: Illustration of decoding an arithmetic coded bitstream.      118

图 78:VP9 中超级块的 4x4 块的去块处理顺序。      123

Figure 78: Order of processing of deblocking for 4x4 blocks of a superblock in VP9.      123

图 79:akiyo 剪辑以 100 kbps 编码并去块。      125

Figure 79: akiyo clip encoded at 100 kbps with deblocking.      125

图 80 :akiyo 剪辑以 100 kbps 编码,禁用去块。      126

Figure 80: akiyo clip encoded at 100 kbps with deblocking disabled.      126

图 81:具有环路去块效应和 SAO 滤波器的视频解码流水线。      127

Figure 81: Video decoding pipeline with in-loop deblocking and SAO filters.      127

图 82:HEVC 中边缘偏移 SAO 滤波器的四个 1D 图案。      128

Figure 82: Four 1D patterns for edge offset SAO Filter in HEVC.      128

图 83:用于识别局部谷、峰、凹角或凸角的像素分类[1]       128

Figure 83: Pixel categorization to identify local valley, peak, concave or convex corners [1].      128

图 84:HEVC 中 BO 的图示,其中虚线是原始样本,实线是重建样本。      129

Figure 84: Illustration of BO in HEVC, where the dotted curve is the original samples and the solid curve is the reconstructed samples.      129

图 85:率失真曲线。      137

Figure 85: Rate distortion curve.      137

图 86:分层图片级位分配方案。      140

Figure 86: Hierarchical picture level bit allocation scheme.      140

图 87:速率控制器机制的元素。      143

Figure 87: Elements of the rate controller mechanism.      143

图 88:显示使用自适应量化[2]的量化偏移变化的热图      145

Figure 88: Heat map showing quant offset variation using Adaptive Quantization [2].      145

图 89:CBR 和 VBR 模式下比特分配的比较。      151

Figure 89: Comparison of bit allocations in CBR and VBR modes.      151

图 90:PSNR 相似但结构内容不同的图像比较。      157

Figure 90: Comparison of images with similar PSNRs but different structural content.      157

图 91:比较 SSIM 与 x264 和 x265 编码的比特率。      163

Figure 91: Comparison SSIM vs bit rates for x264 and x265 encoding.      163

图 92:在三种分辨率和不同比特率下编码的 PSNR 比特率最佳曲线。      167

Figure 92: PSNR-bitrate optimal curve for encoding at three resolutions and various bitrates.      167

图 93:应用机器学习构建模式决策树。      169

Figure 93: Applying machine learning to build mode decision trees.      169

图 94:当前使用 2D 视频编码的 360° 视频传输工作流程。      174

Figure 94: Current 360° video delivery workflow with 2D video encoding.      174





表格列表

List of Tables

表 1:各种应用的常见视频分辨率和带宽。      6个

Table 1: Common video resolutions and bandwidths across applications.      6

表 2:HDR 视频相对于早期 SDR 的增强总结。      12

Table 2: Summary of enhancements in HDR video over earlier SDR.      12

表 3:HDR 格式可用功能的比较和总结。      15

Table 3: Comparison and summary of features available in HDR formats.      15

表 4:每个值的不等位数的示例分配。      22

Table 4: Sample assignment of an unequal number of bits for every value.      22

表 5:视频编码标准演变的时间表。      32

Table 5: Timelines of the evolution of video coding standards.      32

表 6:现代视频编码标准中工具集的比较。      33

Table 6: Comparison of toolsets in modern video coding standards.      33

表 7:跨编解码器的帧内预测比较。      62

Table 7: Comparison of intra prediction across codecs.      62

表 8:HEVC 中使用的插值滤波器系数。      76

Table 8: Interpolation filter coefficients used in HEVC.      76

表 9:HEVC 中使用的色度插值滤波器系数。      77

Table 9: Chroma interpolation filter coefficients used in HEVC.      77

表 10:固定长度 (FL) 二值化的二进制代码。      109

Table 10: Binary codes for fixed length (FL) binarization.      109

表 11:TU 二值化的二进制代码。      110

Table 11: Binary codes for TU binarization.      110

表 12:0 阶和 1 阶 exp-Golomb 二值化代码的二进制代码。      111

Table 12: Binary codes for 0th and 1st order exp-Golomb binarization code.      111

表 13: UEG 0二值化的二进制代码      112

Table 13: Binary codes for UEG0 binarization.      112

表 14:CBR 和 VBR 模式下比特分配的比较。      150

Table 14: Comparison of bit allocations in CBR and VBR modes.      150

表 15:用于 x264 和 x265 编码的 SSIM。      163

Table 15: SSIM for x264 and x265 encoding.      163

表 16:AV1 工具和增强功能与 HEVC 和 VP9 的比较。      170

Table 16: Comparison of AV1 tools and enhancements against HEVC and VP9.      170





缩略词

Acronyms

此列表包括本书中使用的首字母缩略词,按字母顺序列出。

This list includes the acronyms used in the book, listed alphabetically.

AV1

AV1

AO媒体视频1

AOMedia Video1

AVC

AVC

高级视频编码

Advanced Video Coding

奥姆

AOM

开放媒体联盟

Alliance for Open Media

增强现实

AR

增强现实

Augmented Reality

BO

BO

波段偏移筛选

Band Offset filter

民航总局

CABAC

上下文自适应二进制自适应编码

Context Adaptive Binary Adaptive Coding

CAVLC

CAVLC

上下文自适应可变长度编码

Context Adaptive Variable Length Coding

碳纤维

CB

编码块

Coding Block

CBR

CBR

恒定比特率

Constant Bitrate

CPB

CPB

编码图片缓冲区

Coded Picture Buffer

通用报告格式

CRF

恒速因子

Constant Rate Factor

CTB

CTB

编码树块

Coding Tree Block

反恐联盟

CTU

编码树单元

Coding Tree Unit

双离合变速器

DCT

离散余弦变换

Discrete Cosine Transform

DPB

DPB

解码图片缓冲区

Decoded Picture Buffer

夏令时

DST

离散正弦变换

Discrete Sine Transform

例如

EG

Exp-哥伦布代码

Exp-Golomb code

环氧乙烷

EO

边缘偏移筛选

Edge Offset filter

每秒帧数

fps

每秒帧数

frames per second

共和党

GOP

图片组

Group of Pictures

高清

HD

高清

High Definition

高动态范围

HDR

高动态范围

High Dynamic Range

HEVC

HEVC

高效视频编码

High Efficiency Video Coding

人力资源开发

HRD

假设的参考解码器

Hypothetical Reference Decoder

HVS

HVS

人类视觉系统

Human Visual System

IDCT

IDCT

反离散余弦变换

Inverse Discrete Cosine Transform

印尼盾

IDR

瞬时解码器刷新

Instantaneous Decoder Refresh

国际标准化组织

ISO

国际标准化组织

International Organization for Standardization

JCT

JCT

联合协作小组(ISO 和 ITU)

Joint Collaborative Team (ISO and ITU)

合资公司

JVT

联合视频团队

Joint Video Team

康莱特

KLT

Karhunen-Loeve 变换

Karhunen-Loeve Transform

疯狂的

MAD

平均绝对差

Mean Absolute Difference

三菱商事

MC

运动补偿

Motion Compensation

MPEG格式

MPEG

运动图像专家组

Moving Picture Experts Group

ME

运动估计

Motion Estimation

毫升

ML

机器学习

Machine Learning

先生

MR

混合现实

Mixed Reality

MV

MV

运动矢量

Motion Vector

MVD

MVD

运动向量差

Motion Vector Difference

互联网电视

OTT

越过高峰

Over the Top

峰值信噪比

PSNR

峰值信噪比

Peak Signal to Noise Ratio

QP

QP

量化范围

Quantization Parameter

钢筋混凝土

RC

速率控制

Rate Control

RDO

RDO

率失真优化

Rate-Distortion Optimization

红绿蓝

RGB

红绿蓝(彩色格式)

Red Green Blue (Color format)

伤心

SAD

绝对差之和

Sum of Absolute Differences

SAO

SAO

示例自适应偏移滤波器

Sample Adaptive Offset filter

SATD

SATD

绝对变换差之和

Sum of Absolute Transformed Differences

标清

SD

标准定义

Standard Definition

特别提款权

SDR

标准动态范围

Standard Dynamic Range

SEI

SEI

补充增强信息

Supplemental Enhancement Information

SSIM卡

SSIM

结构相似性

Structural SIMilarity

TR

TR

截米

Truncated Rice

超高清

UHD

超高清

Ultra-High Definition

可变比特率

VBR

可变比特率

Variable Bitrate

VCEG

VCEG

视觉编码专家组

Visual Coding Experts Group

VMAF

VMAF

视频多方法评估融合

Video Multimethod Assessment Fusion

点播

VOD

视频点播

Video on Demand

VQ

VQ

视频质量

Video Quality

虚拟现实

VR

虚拟现实

Virtual Reality

YCbCr

YCbCr

具有亮度 (Y) 和两个色度(Cb 和 Cr)的颜色格式

Color Format with Luma (Y) and two Chroma (Cb and Cr)

YUV

YUV

具有亮度 (Y) 和两个色度(U 和 V)的颜色格式

Color Format with Luma (Y) and two Chroma (U and V)

 

 

 

 

前言

Preface

视频编码很复杂。YouTube 和 Netflix 使用它即使在极端的网络传输条件下也能提供出色的视频,但您是否想过他们如何针对低带宽优化视频?是否使用率失真优化、预测编码或自适应量化等技术术语压倒你?您是否尝试过了解视频压缩,但对从哪里开始感到困惑和沮丧?

Video coding is complex. YouTube and Netflix use it to deliver great video even in extreme network transmission conditions, but have you ever wondered how they optimize video for low bandwidths? Do technical terms like rate distortion optimization, predictive coding or adaptive quantization overwhelm you? Have you tried to understand video compression but felt confused and frustrated about where to start?

这是一本全面的书,可以突破您在理解此类技术问题时可能遇到的任何障碍。章节和小节以易于理解的结构整合了基本的视频编码概念。从解码到编码是唯一一本旨在回答 H.264 、H.265元素的原理和原因的书籍, 和 VP9 视频标准。它以清晰的语言解释了这三个成功标准中的常用编码工具,并尽可能提供了示例和插图。本书既不涉及任何特定标准,也不试图表明某个标准是否优于其他标准。它让视频工程师和学生了解所有主要标准下的压缩基础知识,以帮助他们解决问题、开展研究并更好地为客户服务。

This is a comprehensive book that can break through any barriers you may have to understand such technological matters. The chapters and sections consolidate fundamental video coding concepts in an easy-to-assimilate structure. Decode to Encode is the only book that has been designed to answer the hows and whys of elements of the H.264, H.265, and VP9 video standards. It explains the common coding tools in these three successful standards in clear language, providing examples and illustrations as much as possible. The book neither pertains to any specific standard nor attempts to show if one standard is better than any other. It provides video engineers and students with the understanding of compression fundamentals underlying all major standards that they need to help solve problems, conduct research and serve their customers better.

我一生都在研究视频编码,并且积极参与开发基于 MPEG2、H.264 、H.265和 VP9 编码和解码标准的编码算法和软件。在为本书做研究时,我借鉴了多年作为视频编解码器工程师和产品经理的个人经验。我还与该行业的众多专家进行了交谈,并参考了有关视频压缩主题的书籍和在线资料中的信息。这些都汇编在本书末尾的资源列表中。

I have been a lifetime student of video coding and an active contributor to the development of encoding algorithms and software based on the MPEG2, H.264, H.265 and VP9 standards for encoding and decoding. When researching for this book, I drew on years of personal experience as a video codec engineer and product manager. I've also talked with numerous experts in the industry and drawn on information in books and online material on video compression topics. These are compiled in a list of resources at the end of this book.

知识就是力量,时间就是金钱。视频专业人士、学生和其他人可以使用本书快速打下坚实的基础,成为下一代视频技术的专家。行业的管理者和领导者可以使用它来建立专家团队并显着提高生产力和创造力。视频技术发展迅速,压缩率在几代标准中提高了 500% 以上。尽管如此,构成视频和图像压缩支柱和框架的关键概念在经历了四十年的发展之后仍然保持不变。您手里拿着的这本书是我在 2000 年代初还是视频编码初学者时希望拥有的那本书。

Knowledge is power, and time is money. Video professionals, students, and others can use this book to quickly build a solid foundation and become experts in next-generation video technologies. Managers and leaders in the industry can use it to build expert teams and significantly boost productivity and creativity. Video technology has advanced rapidly, and compression has improved by over 500% across several generations of standards. Still, the key concepts that have formed the backbone and framework of video and image compression remain the same after four decades of progress. The book you hold in your hands is the one I wish I had when I was a beginner in video coding back in the early 2000s.

来自旧金山的软件工程师 Caleb Farrand 说:“有了这本书,我在办公室听到的很多词汇开始变得有意义,我对编码器的作用和设计方式有了更好的理解”

Caleb Farrand, a software engineer from San Francisco has said "With this book, much of the vocabulary I’d hear around the office started to make sense and I got a better understanding of what encoders do and how they are designed"

为什么要成为被落在后面的人,因为你只是淹没在日常工作中而没有时间?相反,通过建立基础来更好地了解不断变化的视频技术领域并成为其中的一部分,从而改变您今天的职业路线图。成为业内其他人仰望专业知识的人。

Why be the person who gets left behind because you are just drowning in everyday work and don't have the time? Instead, transform your career roadmap today by building your foundations to better understand the changing video technology landscape and be a part of it. Become the person others in the industry look up to for expertise.

我保证,一旦您理解了本书中涉及的概念,在与同行或客户进行下一次会谈时,您会感到更加自信,并且像专家一样。我保证你会比以往任何时候都受到更多的启发,去探索视频压缩的高级主题,并采取现在看起来过于艰巨和牵强附会的举措。如果你迫切需要动力来实现你的职业目标,这本书可以提供帮助。

I promise that once you understand the concepts dealt with in this book, you will feel significantly more confident and like an expert as you walk into the next meeting with peers or customers. AND I promise you will be inspired more than ever to explore advanced topics in video compression and take up initiatives that just seem too daunting and far-fetched right now. If you desperately need momentum to leapfrog to your career goals, this book can help.

您可以在此处获得了解视频编码概念所需的一切。压缩主题已按照将所有主要编码标准(如 H.264 、H.265)联系在一起的公共线程进行了精心组织和 VP9。这使得信息易于吸收,即使在您繁忙的日程安排、旅行和灵活的工作环境中也是如此。当您继续阅读时,您会发现每一章都会让您对压缩管道的后续步骤有新的见解,并为您提供了解如何构建新的视频编码技术所需的工具。反过来,这将帮助您了解如何优化视频以满足 360° 视频、虚拟现实、增强现实和混合现实等新兴体验技术的要求。视频技术领域比以往任何时候都更加令人兴奋,如果您在这个领域工作,成长的可能性是无限的!

Everything you need to understand video coding concepts is available to you right here. The compression topics have been carefully organized along the common thread that ties together all the major coding standards like H.264, H.265, and VP9. This makes the information easy to absorb, even in the middle of your busy schedule, travels, and agile work environment. As you keep reading, you will find that each chapter will give you new insights into the next steps in the compression pipeline and equip you with the tools you need to understand how newer video coding technologies are built. This will, in turn, help you to understand how video can be optimized to meet the requirements of emerging experiential technologies like 360° video, virtual reality, augmented reality and mixed reality. The video technology space is more exciting than ever and if you are working in this space, the possibilities to grow are endless!

 

 

阿维纳什•拉马钱德兰

Avinash Ramachandran

2018 年 11 月

November 2018

致谢

Acknowledgements

我要感谢所有在本书完成过程中以多​​种方式帮助我的人。书中提供的示例使用原始视频akiyo (CIF)、in_to_tree(由瑞典 SVT 提供)、斯德哥尔摩DOTA2Xiph.org Video Test Media [derf's collection] 网站上托管的序列。标准组织、公司和协会(如 JCT-VC、IEEE、Fraunhofer HHI、Google 和推动视频压缩技术最先进研究的大学)专家的工作和努力为本书做出了重大贡献。我要感谢关于该主题的几本优秀书籍、在线技术博客文章和研究论文的作者和贡献者,我在撰写本书时参考了他们的材料。本书的资源部分还包含这些资源的详尽列表。特别感谢 Rakesh Patel、Oliver Gunasekara、Yueshi Shen、Tarek Amara、Ismaeil Ismaeil、Akrum Elkhazin、Harry Ramachandran 和 Edward Hong 的审阅、讨论和支持。感谢 Susan Duncan 提供的所有编辑协助。这本书之所以能够问世,完全是因为我美好家庭的不懈支持:我的母亲 Vasanthi、我的妻子 Veena 以及我们的孩子 Agastya 和 Gayatri。

I would like to acknowledge everyone whose help I have benefitted from in many ways in bringing shape to this book. The examples presented in the book use the raw video akiyo (CIF), in_to_tree (courtesy SVT Sweden), Stockholm and DOTA2 sequences hosted on the Xiph.org Video Test Media [derf's collection] website. The work and efforts of experts in standards organizations, companies and associations like JCT-VC, IEEE, Fraunhofer HHI, Google and universities driving state-of-the-art research in Video Compression technologies have contributed significantly to this book. I would like to thank authors and contributors of several excellent books on the subject, online technical blog articles and research papers whose material I consulted while writing this book. An extensive list of these sources is also included in the resources section of the book. Special thanks for review, discussions, and support are also due to Rakesh Patel, Oliver Gunasekara, Yueshi Shen, Tarek Amara, Ismaeil Ismaeil, Akrum Elkhazin, Harry Ramachandran and Edward Hong. Thanks to Susan Duncan for all the editorial assistance. This book has been possible only because of the unstinting support of my wonderful family: my mother Vasanthi, my wife Veena and our children Agastya and Gayatri.

 

 

 

 

本书的组织

Organization of the Book

本书分为三个部分。第一部分向读者介绍了一般的数字视频,并为涉足视频压缩奠定了基础。它提供了与数字视频相关的基本概念的详细信息,以及对视频压缩方式和我们在压缩视频时所利用的视频特性的见解。它还涵盖了如何准确地逐步利用这些特性来实现我们今天拥有的显着压缩水平。第一部分总结了视频编解码器的发展简史,并总结了重要的视频压缩标准及其组成编码工具。

This book is organized into three parts. Part I introduces the reader to digital video in general and lays the groundwork for a foray into video compression. It provides details of basic concepts relevant to digital video as well as insights into how video is compressed and the characteristics of video that we take advantage of in order to compress it. It also covers how exactly these characteristics are exploited progressively to achieve the significant level of compression that we have today. Part I concludes by providing a brief history of the evolution of video codecs and summarizes the important video compression standards and their constituent coding tools.

在此基础上,第二部分重点介绍了所有关键的压缩技术。它首先详细介绍了所有视频编码标准(包括 H.264、H.265 、VP9 和 AV1)中采用的视频编码器和解码器的基于块的架构。第二部分的每一章都解释了视频编码和解码流水线中的一个核心技术或模块。这些包括帧内预测, 帧间预测, 运动补偿、变换和量化、环路滤波和速率控制. 我用数字和视觉示例慷慨地说明了这些技术,以帮助提供直观的理解。这些插图在整本书中都使用了著名的、行业认可的剪辑,包括in_to_tree stockholm akiyoDOTA2 。我还试图不仅解释整体信号流,还解释为什么事情会这样进行。

Building on this foundation, Part II focuses on all the key compression technologies. It starts by covering in detail the block-based architecture of a video encoder and decoder that is employed in all video coding standards, including H.264, H.265, VP9, and AV1. Each chapter in Part II explains one core technique, or block, in the video encoding and decoding pipeline. These include Intra prediction, inter prediction, motion compensation, transform and quantization, loop filtering and rate control. I have generously illustrated these techniques with numerical and visual examples to help provide an intuitive understanding. Well-known, industry-recognized clips, including in_to_tree, stockholm, akiyo, and DOTA2 have been used throughout the book for these illustrations. I have also attempted to provide explanations of not just the overall signal flow but also why things are done the way they are.

配备了所有必要的技术细节后,您将准备好在第 III 部分中探索所有这些细节如何共同构成一个编码器、如何配置编码器以及如何在不同的应用场景中使用它。第三部分介绍了不同的应用场景,并展示了如何使用第二部分中详述的工具调整编码器以实现压缩。具体而言,该部分详细解释了各种比特率模式、质量指标和可用性以及不同编解码器的性能测试。第三部分的最后一章介绍了视频技术领域即将到来的发展,包括内容特定、按标题优化的编码、机器学习工具在视频压缩中的应用、下一代 AV1 中的视频编码工具编码标准,以及用于 360 视频和 VR 等新体验视频平台的压缩。

Equipped with all the essential technical nuts and bolts, you will then be ready to explore, in Part III, how all these nitty-gritties together make up an encoder, how to configure one and how to use it in different application scenarios. Part III presents different application scenarios and shows how encoders are tuned to achieve compression using the tools that were detailed in Part II. Specifically, the section explains in detail the various bit rate modes, quality metrics and availability, and performance testing of different codecs. Part III concludes with a chapter on upcoming developments in the video technology space, including content-specific, per-title optimized encoding, the application of machine learning tools in video compression, video coding tools in the next generation AV1 coding standard, and also compression for new experiential video platforms like 360 Video and VR.

我希望这本书能够以说明的方式传达整个视频压缩领域,并激励读者在令人兴奋和快速发展的视频编码领域进行进一步的探索、合作和开创性研究。

I hope that the book is able to illustratively convey the entire video compression landscape and to inspire the reader toward further exploration, collaborations, and pioneering research in the exciting and rapidly-advancing field of video coding.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

第一部分

Part I

1个 数字视频简介

 

 

 

 

 

 

 

 

 

 

 

 

 

 

在本章中,我们将探讨如何以数字方式表示视觉场景。我将解释数字视频中使用的各种专业术语。在我们探索数字视频压缩领域之前,这很有用。如果您对未压缩的数字视频有一定的了解,您可以略过本节或完全跳过本节并继续阅读下一章。完成本章后,您将更好地理解抽样等术语、色彩空间和位深度适用于数字视频。

In this chapter, we explore how visual scenes are represented digitally. I will explain various specialized terms used in digital video. This is useful before we explore the realm of digital video compression. If you have a working knowledge of uncompressed digital video, you may briefly skim through this section or skip it entirely and proceed to the next chapter. Once you complete this chapter, you will better understand how terms like sampling, color spaces and bit depths apply to digital video.

数字视频是连续视觉场景的数字表示。在最简单的意义上,运动中的视觉场景可以表示为一系列静止图片。当这些静止图片连续快速显示时,人眼会将图片解读为移动场景,而不是感知单个图像。这就是为什么在电影制作的早期,它被称为电影这一点,随着时间的推移,浓缩成了电影

Digital video is the digital representation of a continuous visual scene. In its simplest sense, a visual scene in motion can be represented as a series of still pictures. When these still pictures are consecutively displayed in rapid progression, the human eye interprets the pictures as a moving scene rather than perceiving the individual images. This is why, during the early days of filmmaking, it was called moving pictures. This, over time, became condensed to movies.

MOVIES =动画+图片

MOVIES = Moving + Pictures

因此,为了捕捉视觉场景并以数字方式表示,相机会暂时对场景进行采样;也就是说,随着时间的推移,它们会每隔一段时间从场景中获取静止图像。 以固定的时间间隔捕获和显示完整图片的方法导致了业内称为逐行视频的结果。此外,每个时间图像都经过空间采样以获得单个数字像素

To capture a visual scene and represent it digitally, the cameras therefore temporally sample the scene; that is, they derive still images from the scene at intervals over time. The method of capturing and displaying a complete picture at regular intervals of time results in what is referred in the industry as progressive video. Also, every temporal image is spatially sampled to get the individual digital pixels.

1.1 隔行视频

在电视的早期,隔行扫描视频技术用于表示投影在 CRT 上的视频图像。在隔行扫描视频中,包含一行像素的每一行都被扫描以构成图片。这些行中的每一行都称为,交替的场行被连续扫描和显示。首先扫描每个奇数场,然后扫描每个偶数场。这些字段中的每一个都以显示完整帧所用时间的一半显示。因此,单个视频帧被扫描并显示为相互交织的两个半帧。这在下面的图 1 中显示。由于这些字段的显示时间是显示帧所需时间的一半,因此它发生得非常快,我们得到了全帧的错觉。

In the early days of television, interlaced video technology was used to represent video images that were projected on CRTs. In interlaced video, every line comprising a row of pixels is scanned to make up a picture. Each of these lines is called a field and alternate field lines are scanned and displayed in succession. Every odd field is scanned first followed by every even field. Each of these fields is displayed in half the time used to display a complete frame. Thus, a single frame of video is scanned and displayed as two half frames that are interwoven with one another. This is shown in Figure 1 below. As these fields are displayed at half the time it takes to display a frame, it happens very quickly, and we get the illusion of a full frame.

今天,我们已经从模拟视频发展到高清数字视频。然而,传统的隔行扫描视频格式仍然存在于线性视频广播中。当使用隔行扫描内容时,视频在视频分辨率后用“i”表示,例如 1080i、480i。我们在本节中简要讨论了隔行扫描视频机制,以提供良好的背景知识。本书的其余部分将专注于渐进式视频技术。

Today, we have progressed from analog video to high definition digital video. However, legacy interlaced video formats still exist in linear video broadcasting. When interlaced content is used, the video is represented with an ‘i’ after the video resolution e.g. 1080i, 480i. We have briefly discussed interlaced video mechanism in this section to provide a good background. The remainder of the book will focus exclusively on progressive video technology.

图像

图 1:扫描隔行扫描视频中的奇数行和偶数行

Figure 1: Scanning odd and even lines in interlaced video

1.2 采样

那么,什么是抽样?采样是将连续信号转换为离散信号。这可以在空间和时间上完成。在视频的情况下,在某个时间点采样的视觉场景会产生数字视频的帧或图片。通常,场景在一秒内以 30 或 25 帧采样,但是,每秒 24 帧的采样率用于电影制作。当帧以采样率(每秒帧数或 fps)快速回放时,它们会产生电影效果。每一帧依次由三个部分组成,通常需要其中一个来表示单色图片,而其余两个仅用于彩色图像。这些分量是通过对图像进行空间采样获得的,它们一起称为像素像素。因此,视频中的每个像素都有一个分量(对于单色)或三个分量(对于彩色)。构成视频帧的空间像素数量决定了源被捕获和表示的准确程度。如图2所示,这是视觉场景中水平和垂直采样点的二维数组,总数是称为视频分辨率的参数。在数学上,这可以表示为:

So, what is sampling? Sampling is the conversion of a continuous signal to a discrete signal. This can be done in space and time. In case of video, the visual scene sampled at a point in time produces a frame or picture of the digital video. Normally, scenes are sampled at 30 or 25 frames in one second, however, a 24 frames-per-second sampling rate is used for movie production. When the frames are rapidly played back at the rate at which they were sampled (frames per second or fps), they produce the motion picture effect. Each frame, in turn, is composed of three components, one of which is usually needed to represent monochrome pictures and the remaining two are included only for color images. These components are obtained by sampling the image spatially and together are called pixels or pels. Thus, every pixel in a video has one component (for monochrome) or three components (for color). The number of spatial pixels that make a video frame determines how accurately the source has been captured and represented. As shown in Figure 2, this is a 2-dimensional array of horizontal and vertical sample points in the visual scene and the total number is the parameter called video resolution. Mathematically, this can be expressed as:

分辨率 = H(以像素为单位)x V(以像素为单位)

Resolution = H (in pixels) x V (in pixels)

图像

图 2:图片的水平和垂直分辨率。

Figure 2: Horizontal and vertical resolutions of a picture.

因此,如果视频的分辨率为 1920x1080,则意味着它有 1920 个水平像素样本和 1080 个垂直像素行。需要注意的是,视频的分辨率通常指的是视频的第一个分量,即亮度而两个颜色或色度分量可以在相同或更低的采样下采样比率。分辨率与以每秒帧数表示的帧捕获率相结合,决定了捕获的数字图像对原始视觉场景的保真度。反过来,这也决定了需要多少处理和带宽来有效地编码视频以进行传输和存储。

Thus, if a video has a resolution of 1920x1080, this means that it has 1920 horizontal pixel samples and 1080 vertical pixel rows. It should be noted that the resolution of the video usually refers to the first component of the video, namely, luminance, while the two color, or chrominance, components may be sampled at the same or lower sampling ratios. The resolution, combined with the frame capture rate expressed in frames per second, determines the captured digital image's degree of fidelity to the original visual scene. In turn, this also determines how much processing and bandwidth is needed to efficiently encode the video for transmission and storage.

下面的图 3 显示了亮度分量(称为亮度) 和颜色分量(色度)在典型图片中采样。在这个例子中,与亮度相比,色度在水平和垂直方向上都被二次采样了 2 倍。图 4 说明了时间采样的方式成图片和空间采样成像素构成了整个视频序列的数字表示。

Figure 3, below, shows how both the brightness component (called luminance) and color components (chrominance) are sampled in a typical picture. In this example, the chrominance is subsampled by a factor of 2, both horizontally and vertically, compared to the luminance. Figure 4 illustrates how the temporal sampling into pictures and the spatial sampling into pixels comprise the digital representation of an entire video sequence.

图像

图 3:亮度和色度空间采样在一张照片中。

Figure 3: Luminance and chrominance spatial sampling in a picture.

图像

图 4:空间和时间采样的视觉场景。

Figure 4: The spatial and temporal sampling of a visual scene.

当必须压缩视频时(通过称为编码的过程),可以更改分辨率,具体取决于各种因素,例如存储或传输的带宽可用性。为了适应这些限制,编码视频的比特率或视频分辨率或它们的组合被调整以确保最终编码视频可以在整个系统的限制范围内交付。通常,捕获的视频源被缩小到所需的分辨率,然后进行编码。这在图 5 中进行了说明,它显示了来自in_to_tree视频序列[1]的帧。 该剪辑已按不同分辨率缩小。在互联网视频流中,视频通常以多种分辨率编码,每种分辨率都可以满足具有不同比特率分配的用户的请求。下表概述了一些应用程序常用的视频分辨率和带宽。

When a video has to be compressed (by a process called encoding), the resolution can be changed, depending on various factors like bandwidth availability for storage or for transmission. To fit within these constraints, either the bitrate of the encoded video or the video resolution or a combination of these is adjusted to ensure the final encoded video can be delivered within the constraints of the overall system. Usually, the captured video source is downscaled to the required resolution and then encoded. This is illustrated in Figure 5, which shows a frame from the in_to_tree video sequence [1]. The clip has been downscaled at different resolutions. In internet video streaming, the video is often encoded at multiple resolutions, each of which can serve requests from users who have different bitrate allocations. The following table outlines commonly used video resolutions and bandwidths for some applications.

表 1:各种应用的常见视频分辨率和带宽。

Table 1: Common video resolutions and bandwidths across applications.

解决

Resolution

典型比特率

Typical Bitrates

应用

Applications

320x240 30fps

320x240 at 30fps

200 kbps – 500 kbps

200 kbps – 500 kbps

移动视频

Mobile video

720x480 30fps

720x480 at 30fps

720x576 25fps

720x576 at 25fps

500 kbps – 2 Mbps

500 kbps – 2 Mbps

存储(DVD)和广播电视传输

Storage (DVD) & broadcast TV transmission

1280x720 30 和 25fps

1280x720 at 30 and 25fps

1 Mbps – 3 Mbps

1 Mbps – 3 Mbps

视频通话、互联网视频流

Video calling, internet video streaming

1920x1080 30 和 60fps

1920x1080 at 30 and 60fps

1280x720 60fps

1280x720 at 60fps

4 Mbps – 8 Mbps

4 Mbps – 8 Mbps

互联网视频流、存储和广播传输

Internet video streaming, storage and broadcast transmission

 

 

图像

图 5:一帧视频缩小到不同的分辨率[1]

Figure 5: A frame of video downscaled to different resolutions [1].

下面的图 6 显示了帧中的一个 16x16 块,当放大时,它清楚地显示了 16 x 16 像素方阵中的不同颜色阴影。此缩放图像中的每个小方块对应一个像素,该像素由三个具有唯一值的分量组成。

Figure 6, below, shows a 16x16 block in the frame that, when zoomed in, clearly shows the varying color shades in the 16 x 16 square matrix of pixels. Every small square block in this zoomed image corresponds to a pixel that is composed of three components with unique values.

图像

图 6:帧中的 16x16 像素块说明了 256 个正方形像素[1]。

Figure 6: A 16x16 block of pixels in a frame that illustrates the 256 square pixels [1].

1.3 色彩空间

构成现实世界中视觉场景的颜色需要转换为数字格式,以便将视觉场景表示为一系列图片。颜色模型是一种将任何颜色转换为数值的数学模型,因此可以对其进行数字处理。使用颜色模型,颜色可以表示为一些基色分量的组合。色彩空间是颜色模型的特定实现,它将现实世界的颜色映射到颜色模型的离散值。添加此映射功能可提供颜色空间支持的确定区域或范围的颜色,这称为色域。两种流行的颜色空间用于表示视频,即 RGB和 YUV。这些可以使用数学函数互换。也就是说,一种表示可以从另一种中推导出来。

Colors that make up visual scenes in the real world need to be converted to a digital format in order to represent the visual scene as a series of pictures. A color model is a mathematical model that converts any color to numerical values, so it can be processed digitally. Using the color model, a color can be represented as combinations of some base color components. A color space is a specific implementation of a color model that maps colors from the real world to the color model’s discrete values. Adding this mapping function provides a definite area or range of colors that is supported by the color space and this is called a color gamut. Two popular color spaces are used to represent video, namely, RGB and YUV. These are interchangeable using mathematical functions. That is, one representation can be derived from the other.

红绿蓝顾名思义,在此模型中,每种颜色都表示为红色 (R)、绿色 (G) 和蓝色 (B) 分量的组合。这些组件中的每一个都是独立的,这三个组件的组合产生了色彩空间的各种阴影.

RGB: As the name implies, in this model, every color is represented as a combination of red (R), green (G) and blue (B) components. Each of these components is independent and combinations of these three components produce the various shades of the color space.

YCbCr:在这种通常也称为 YUV 的方案中,Y 分量指的是亮度或强度,Cb 和 Cr 指的是色度或颜色分量。由于以下特性有助于减少表示所需的位数,因此这种像素表示方案在视频应用中最为流行:

YCbCr: In this scheme which is often also referred to as YUV, the Y component refers to luma or intensity and Cb and Cr refer to chroma or color components. This scheme of representation of pixels is the most popular for video applications due to the following characteristics that help to reduce the bits needed for the representation:

在这个方案中,lumachroma分量完全相互独立地表示。这意味着,在单色视频的情况下,只需要一个分量 (Y)即可完整表示视频信号。
人类视觉系统(HVS) 对亮度 (Y) 更敏感,对色度 (UV) 不太敏感。通过强调亮度并有选择地忽略色度分量,可以获得显着的降低,而不会对观看者的视频体验产生巨大影响。这可以通过相对于亮度对色度进行子采样来实现。子采样将在第 2 章中更详细地介绍

在本书中,除非另有说明,否则我们将专门处理 YUV 视频格式,用于所有视频说明。

In this book, we will be dealing with YUV video format exclusively for all video explanations unless otherwise stated.

图像

图 7:彩色图像及其三个组成部分[1]

Figure 7: A color image and its three constituent components [1].

如图 7 所示,图像中的 Y 分量对应于强度。这可用于表示图像或视频的单色版本,而 U 和 V 分量一起构成颜色分量。

As illustrated in Figure 7, the Y component in the image corresponds to the intensity. This can be used to represent the monochrome version of the image or video whereas the U and V components together constitute the color components.

1.4 位深

用于表示像素的位数决定了从源捕获视觉信息的准确程度。它还确定了可以用样品表示的颜色的强度变化和范围。也就是说,如果仅使用 1 位来存储像素值,则它的值可能为 0 或 1。因此,使用该像素只能表示两种颜色:黑色或白色。但是,如果使用 2 位,那么每个像素都可以表示 4(或 2 2)种颜色中的任意一种,值为 0、1、2 和 3。

The number of bits used to represent a pixel determines how accurately the visual information is captured from the source. It also determines the intensity variation and range of colors that can be expressed with the sample. That is, if only 1 bit were used to store a pixel value, it could have a value of either 0 or 1. As a consequence, only two colors could be expressed using this pixel: black or white. However, if 2 bits were used then every pixel could represent any of 4 (or 22) colors with values 0, 1, 2 and 3.

视频像素通常使用每个样本 8 位来表示。这允许 256(或 2 8)种颜色和强度变化,值范围为 0 到 255。但是,通常的做法是限制活动亮度到 16(黑色)到 235(白色)的范围。在工作室制作期间,值 1–15 和 236–254 分别用于脚部空间头部空间。

Video pixels are usually represented using 8 bits per sample. This allows for 256 (or 28) variations in color and intensity with values in the range of 0 to 255. However, the normal practice is to restrict the active luminance to a range of 16 (black) to 235 (white). Values 1–15 and 236–254 are reserved for foot room and head room, respectively, during studio production.

在专业视频制作系统中,视频序列使用每个样本 10 位进行处理。这允许 1024 级,这样可以捕获强度和颜色的更细微级别。10 位编码也越来越受 UHD 分辨率和 HDR功能的欢迎,以在消费类视频系统中提供更丰富的视觉体验。现代视频压缩标准也支持这一点。

In professional video production systems, a video sequence is processed using 10 bits per sample. This allows 1024 gradations, such that much more subtle levels in intensity and color can be captured. 10-bit encoding is also becoming increasingly popular for UHD resolutions and HDR functionality to provide a richer visual experience in consumer video systems. This is also supported in modern video compression standards.

图像

图 8:像素值的 8 位和 10 位表示。

Figure 8: 8-bit and 10-bit representation of pixel values.

1.5 高动态范围

视频技术不断从高清 (HD) 分辨率发展到超高清 (UHD) 及更高分辨率。新技术提供了高清分辨率的四倍或更多倍。鉴于这种演变,更好地表示这种高密度像素和相关颜色变得很重要,以实现它们可能带来的增强观看体验。已经探索了各种方法来改进数字视频场景的表示,包括以下两种:

Video technology continues to evolve from high definition (HD) resolutions to ultra HD (UHD) and beyond. The new technologies offer four or more times the resolution of HD. Given this evolution, it becomes important to better represent this high density of pixels and the associated colors in order to achieve the enhanced viewing experience that they make possible. Various methods have been explored to improve the representation of a digital video scene, including these two:

  1. 改进的空间和时间数字采样. 如上所述,这包括提供更快帧速率和更高分辨率的技术。
  2. Improved spatial and temporal digital sampling. This includes techniques to provide faster frame rates and higher resolutions, as mentioned above.
  3. 更好地表示像素值。HDR 视频中采用了多种技术来改进像素中各种颜色和阴影的表示方式。
  4. Better representation of the values of the pixels. Several techniques are incorporated in HDR video to improve how various colors and shades are represented in the pixels.

而前者主要处理不同的像素采样方式,后者专注于每个单独的像素本身。这是对传统标准动态范围(SDR) 视频的增强。高动态范围视频通过整合像素表示各个方面的改进,极大地改善了视频观看体验。在本节中,我们将探讨这是如何完成的。

While the former mostly deals with different ways of pixel sampling, the latter focuses on every individual pixel itself. This is an enhancement over the traditional standard dynamic range (SDR) video. High dynamic range video provides a very significant improvement in the video viewing experience by incorporating improvements in all aspects of pixel representations. In this section, we shall explore how this is done.

HDR ,顾名思义,提供改进的动态范围,这意味着它扩展了完整的亮度范围值,从而在深色和浅色色调方面提供更丰富的细节。它不止于此,还改进了颜色的表示。总体结果是视频场景的渲染更加自然。

HDR, as the name indicates, provides improved dynamic range, meaning that it extends the complete range of luminance values, thereby providing richer detail in terms of the tone of dark and light shades. It doesn't stop there but also provides improvements in the representation of colors as well. The overall result is a far more natural rendering of a video scene.

重要的是要注意视频的 HDR 技术与同样使用 HDR 术语的数码摄影中使用的技术完全不同。在摄影中,使用不同的曝光值,并通过创建多个局部对比度来混合捕获以扩展像素的动态范围。然而,每次捕获仍然使用相同的 8 位深度和 256 级亮度。视频中的 HDR 不仅仅是动态范围扩展,还包含以下内容[ 2]

It’s important to note that HDR technology for video is quite different from the technology used in digital photography that also uses HDR terminology. In photography, different exposure values are used, and the captures are blended to expand the dynamic range of pixels by creating several local contrasts. However, every capture still uses the same 8-bit depth and 256 levels of brightness. HDR in video extends beyond just the dynamic range expansion to encompass the following [2]:

具有更高峰值亮度和更低黑电平的高动态范围,提供更丰富的对比度;
改进的色彩空间广色域(WCG),具体来说,是一种称为 Rec. 的新颜色标准。2020 取代了早期的 Rec。709用于特别提款权;
改进的位深度,使用 10 位(分发)或 12 位(生产)代替 SDR 中的传统 8 位;
改进的传递函数,例如, PQ,HLG 等用于代替早期的伽马函数;
改进的元数据因为 HDR包括添加静态(针对整个视频序列)和动态(场景或特定图片)元数据,这有助于增强渲染。

表 2:HDR 增强功能总结早期 SDR 上的视频。

Table 2: Summary of enhancements in HDR video over earlier SDR.

特征

Features

特别提款权

SDR

高动态范围

HDR

动态范围

Dynamic range

标准

Standard

增强的动态范围,具有高峰值亮度和更低的黑电平以及更高的对比度

Enhanced dynamic range with high peak brightness and lower black levels and greater contrast

位深度

Bit depth

8位

8-bit

10 位或 12 位

10-bit or 12-bit

色彩空间

Color space

REC.709

REC.709

Rec.2020

Rec.2020

转换功能

Transfer function

基于伽马

Gamma based

不同的新标准:PQ、HLG 等

Different new standards: PQ, HLG etc

元数据

Metadata

不存在

Not present

静态或动态

Static or dynamic

 

 

在本章的其余部分,我们将简要探讨这些增强功能中的每一个,并考虑它们如何应用于视频编码技术。

In the remainder of this chapter, we shall explore each of these enhancements briefly and consider how they apply to video encoding technology.

1.5.1 色彩空间

1990 年代初,HDTV 标准建立,色域由称为 Rec. 的标准定义。ITU-R 英国电信。709(通常称为 Rec.709)。这在 2012 年的 Rec. 下针对 UHD 得到了增强。ITU-R 英国电信。2020 (Rec. 2020) 提供更大的色彩空间支持更多种类的色调。下面的图 9 比较了 Rec. 下可用的色彩空间。709 与 Rec。2020. 每个内部绘制的黑色三角形代表该标准下色度的覆盖范围。显然,Rec。2020 支持具有更多色调的更广泛的色域。这就是在 HDR 中实现的

In the early 1990s, the HDTV standard was established with the color gamut defined by a standard called Rec. ITU-R BT. 709 (popularly known as Rec. 709). This was enhanced for UHD in 2012 under Rec. ITU-R BT. 2020 (Rec. 2020) to provide a far larger color space that supports a greater variety of shades. Figure 9, below, compares the color spaces available under Rec. 709 versus Rec. 2020. The black triangle drawn within each represents the coverage of color shades under that standard. Clearly, Rec. 2020 supports a more expanded color gamut with very many more shades. This is what is implemented in HDR.

图像图像

图 9:Rec. 中的颜色范围。709(左)与 Rec. 2020 年(右)。

Figure 9: Color ranges in Rec. 709 (left) versus Rec. 2020 (right).

来源:https://commons.wikimedia.org/wiki/File:CIExy1931.svg [ 3]

Source: https://commons.wikimedia.org/wiki/File:CIExy1931.svg [3]

1.5.2 位深

如上一节所述,位深度是用于表示每个像素的位数。这反过来又决定了可以表示的颜色总数。传统的 SDR 视频使用 8 位深度,这意味着每个像素可以有 256 个不同的红色、绿色和蓝色值(或相应地,Y、U 和 V)。这导致总共 256 x 256 x 256,或每个像素约 1600 万种颜色。高动态范围支持视频分发的 10 位颜色,这意味着现在每个像素每个颜色分量最多可以表示 1026 个值或超过十亿种颜色。这种巨大的增加包含了范围更广的色调。这导致相同颜色组内的颜色过渡更平滑,几乎没有伪影。因此,即使将像素数量增加 2 位,对于改善视觉体验也非常有用。然而,它确实带来了在任何内部过程(例如编码)期间增加的内存需求和计算开销。

Bit depth, as was explained in the previous section, is the number of bits used to represent every pixel. This, in turn, determines the total number of colors that can be represented. Traditional SDR Video uses 8-bit depth, meaning that every pixel can have 256 different values each for red, green and blue (or correspondingly, for Y, U, and V). This results in a total of 256 x 256 x 256, or about 16 million colors per pixel. HDR supports 10-bit color for video distribution, meaning that every pixel now can represent up to 1026 values per color component or over a billion colors. This massive increase packs in a far more extensive range of shades. This results in smoother color transitions within the same color groups with few artifacts. Thus, increasing the number of pixels even by 2 bits is extremely useful in improving the visual experience. It does bring with it, however, increased memory requirements and computational expense during any internal process, such as encoding.

1.5.3 转换功能

电子图像设备和视频捕获和显示设备需要将电子信号转换为数字表示的像素值,反之亦然。这些设备具有非线性、光强度对信号或信号对强度的特性。例如,来自相机的电压与场景中的光强度(功率)具有非线性关系。传统上,如 Rec. 中所指定。709 颜色标准,非线性映射是使用称为伽玛传递函数的幂函数完成的. 这在通用方程式中表示

Electronic image devices and video capture and display devices have the need to convert electronic signals to digitally represented pixel values and vice versa. These devices have nonlinear, light intensity-to-signal or signal-to-intensity characteristics. As an example, the voltage from a camera has a nonlinear relationship to the intensity (power) of light in the scene. Traditionally, as specified in the Rec. 709 color standard, the nonlinear mapping is accomplished using a power function called the gamma transfer function. This is expressed in the generic equation

输出 = 输入伽玛

output = input gamma

其中gamma,幂函数的指数,完全描述了这个传递函数. 伽玛曲线简单地模拟了旧式电子显示设备(阴极射线管或 CRT)对电压的响应方式,不再与现代显示器完全相关。它在 HDR 中的 PQ(感知量化)和 HLG(混合对数伽马)等新功能中得到了改进

where gamma, the exponent of the power function, completely describes this transfer function. The gamma curve, which simply modeled how older electronic display devices (cathode ray tubes, or, CRTs) responded to voltage, is no longer completely relevant for modern displays. It has been improved upon in newer functions like PQ (perceptual quantization) and HLG (hybrid log gamma) in HDR.

1.5.4 元数据

视频信号在传输之前使用源端的编码器进行压缩。然后在接收端对压缩信号进行解码。当解码器接收到编码流并生成原始 YUV 像素时,这些像素将需要显示在计算机屏幕或显示器上。这意味着需要将 YUV 值转换为正确的色彩空间. 这包括一个颜色模型和相关的传递函数. 这些定义了像素值如何在显示面板上转换为光子。现代编解码器规定使用补充增强信息(SEI) 消息在比特流标头中用信号通知颜色空间。

Video signals are compressed before transmission using an encoder at the source. The compressed signals are then decoded at the receiving end. When the decoder receives the encoded stream and produces raw YUV pixels, these pixels will need to be displayed on a computer screen or display. This means the YUV values will need to be converted to the correct color space. This includes a color model and the associated transfer function. These define how the pixel values get converted into light photons on the display panels. Modern codecs have provisions for signaling the color space in the bitstream header using supplemental enhancement information (SEI) messages.

但是,当显示设备不支持色彩空间时会发生什么来源是用什么产生的?在这种情况下,重要的是使用源颜色空间特性并将解码器端的颜色空间转换为显示器支持的格式。这对于确保在显示器与源颜色空间不兼容时不会错误显示颜色至关重要。

However, what happens when the display device doesn't support the color space with which the source is produced? In this case, it’s important to use the source color space characteristics and convert color spaces at the decoder end to the display’s supported format. This is crucial to ensure that colors aren't displayed incorrectly when displays are incompatible with the source color space.

HDR增强功能支持一种机制来解释编码图像的特征,并将此信息用作解码过程的一部分,并包含更多种类的传输函数。如前所述,如果使用特定传输函数生成的内容在源头经历视频处理和传输工作流程中的各种转换,然后由显示设备使用另一个传输函数进行映射,内容最终会明显退化。HDR 标准提供增强的元数据将传输函数细节从源传送到解码和显示设备的机制。

The HDR enhancements support a mechanism to interpret the characteristics of encoded pictures and use this information as part of the decoding process, incorporating a greater variety of transfer functions. As explained earlier, if content produced using a specific transfer function at the source goes through various transformations in the video processing and transmission workflow and then gets mapped using another transfer function by the display device, the content ends up perceptibly degraded. HDR standards provide enhanced metadata mechanisms to convey the transfer function details from the source to the decoding and display devices.

1.5.5 HDR风景

HDR景观没有一种统一的格式。相反,它有几个由不同组织开发和部署的选项。以下是五种主要的 HDR 格式。[ 2]

The HDR landscape does not have one unified format. Instead, it has a few options that have been developed and deployed by different organizations. The following are the five main HDR formats. [2]

高级 HDR (由 Technicolor 和飞利浦开发)
具有动态元数据的HDR 10+ (SMPTE ST-2094-40)
具有动态元数据的HDR 10(SMPTE 2094-X)
杜比视界
没有元数据的混合对数伽玛 (HLG)

 

 

表 3 比较并总结了这些 HDR 格式的可用功能。不同的标准使用不同的元数据机制传播。一些标准只是在分发期间添加元数据。较早的标准还使用静态元数据,其中固定的元数据信息在视频创建期间生成,并通过整个流的视频工作流程传播。其他机制使用动态元数据,这些元数据可以在每帧、每场景的基础上变化。

Table 3 compares and summarizes the features available in these HDR formats. Different standards use different mechanisms for metadata transmission. Some standards just add metadata during distribution. Earlier standards also used static metadata in which fixed metadata information is generated during video creation and propagated through the video workflow for the entire stream. Other mechanisms use dynamic metadata which can be varied on a per-frame, per-scene basis.

表 3:HDR 中可用功能的比较和总结格式。

Table 3: Comparison and summary of features available in HDR formats.

格式

Format

元数据

Metadata

细节

Details

高级HDR

Advanced HDR

动态的

Dynamic

Technicolor 和飞利浦

Technicolor and Philips

高动态范围10+

HDR 10+

动态的

Dynamic

三星和松下

Samsung and Panasonic

高动态范围10

HDR 10

动态的

Dynamic

SMPTE 2094-10:杜比

SMPTE 2094-10: Dolby

SMPTE 2094-20:飞利浦

SMPTE 2094-20: Philips

SMPTE 2094-30:彩色

SMPTE 2094-30: Technicolor

SMPTE 2094-40:三星

SMPTE 2094-40: Samsung

杜比视界

Dolby Vision

动态的

Dynamic

杜比实验室

Dolby Labs

混合对数伽马

Hybrid Log-Gamma

不存在

Not present

BBC 和 NHK

BBC and NHK

 

 

为了广泛采用 HDR ,设备必须始终如一地从源头传输信号和相关信息,以作为端到端系统的一部分进行显示。HDR 标准化工作可以大大有助于确保大规模采用 HDR。其他重要活动将包括在编码过程中嵌入 HDR 信息(如 HEVC SEI)以及在显示标准(如 HDMI)中嵌入 HDR 支持。

For widespread HDR adoption, it’s imperative that devices consistently carry the signal and associated information right from the source to display as part of the end-to-end system. HDR standardization efforts can go a long way toward ensuring HDR adoption at scale. Other important activities will include embedding HDR information like HEVC SEI in the encoding process and HDR support in display standards like HDMI.

1.6 概括
  • 数字视频是通过采样获得的连续视觉场景的数字表示及时生成帧,然后对这些帧进行空间采样以获得像素。
  • Digital video is the digital representation of a continuous visual scene that is obtained by sampling in time to produce frames, which in turn is spatially sampled to obtain pixels.
  • 现实世界中的颜色使用颜色空间转换为像素值。
  • Colors in the real world are converted to pixel values using color spaces.
  • 用于表示像素的位数决定了从源捕获视觉信息的准确程度,称为位深度,对于视频通常为 8 位或 10 位。
  • The number of bits used to represent a pixel determines how accurately the visual information is captured from the source and is called bit depth which is often 8-bit or 10-bit for video.
  • HDR技术通过增强像素表示来改善视觉体验。它结合了先进的动态范围、更高的位深度、先进的色彩空间和传递函数。
  • HDR technology improves the visual experience by enhancing pixel representation. It incorporates advanced dynamic range, higher bit depth, advanced color space, and transfer functions.
1.7 笔记
  1. in_to_tree。xiph.org. Xiph.org 视频测试媒体 [derf 的收藏]。 https://media.xiph.org/video/derf/。2018 年 9 月 21 日访问。
  2. in_to_tree. xiph.org. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed September 21, 2018.
  3. 高动态范围视频:实现对 HDR 的支持使用基于软件的视频解决方案。AWS 基本技术。https://goo.gl/7SMNu3。2017 年出版。2018 年 9 月 21 日访问。
  4. High Dynamic Range Video: Implementing Support for HDR Using Software-Based Video Solutions. AWS Elemental Technologies. https://goo.gl/7SMNu3. Published 2017. Accessed September 21, 2018.
  5. 我,坂村。文件:CIExy1931.svg,CIE 1931 色彩空间. 维基共享资源。https://commons.wikimedia.org/wiki/File:CIExy1931.svg。2007 年 7 月 13 日发布。2011 年 6 月 9 日更新。2018 年 9 月 20 日访问。
  6. I, Sakamura. File:CIExy1931.svg, CIE 1931 color space. Wikimedia Commons. https://commons.wikimedia.org/wiki/File:CIExy1931.svg. Published July 13, 2007. Updated June 9, 2011. Accessed September 20, 2018.

 

 

2个 视频压缩

如前所述,数字视频是使用一系列静止图片表示视觉场景。在上一章中,我们看到了数字视频是如何表示的,并解释了采样等概念和色彩空间。在本章中,我们探讨了为什么需要压缩视频以及为实现这一目的而利用的视频信号的特性。简而言之,视频压缩主要关注如何拍摄连续的静止图片、识别和删除其中的冗余以及最小化表示视频序列所需的信息。

As explained earlier, digital video is the representation of a visual scene using a series of still pictures. In the previous chapter, we saw how digital video is represented and explained concepts like sampling and color spaces. In this chapter, we explore why video needs to be compressed and characteristics in the video signal that are exploited to achieve this. In a nutshell, video compression primarily focuses on how to take the contiguous still pictures, identify and remove redundancies in them and minimize the information needed to represent the video sequence.

2.1 景观

视频压缩对于存储和传输视频信号至关重要。通常,视频来自各种来源,例如直播体育赛事和新闻事件、电影、直播视频会议和通话、视频游戏以及增强现实/虚拟现实 (AR/VR) 等新兴应用。虽然直播活动和视频通话等一些应用程序需要实时视频压缩和传输,但电影库、存储和点播流媒体等其他应用程序是非实时应用程序。这些应用程序中的每一个都对编码和解码过程施加了不同的约束,从而导致压缩参数的差异。直播体育赛事需要高质量、实时的编码和非常低的编码和传输延迟,而像 Netflix 或 Hulu 这样的视频点播服务的编码是非实时的,并且专注于最高质量和视觉体验。为此,每个视频压缩标准都提供了各种工具集,可以启用、禁用和调整这些工具集以满足特定要求。所有现代视频压缩标准,包括 MPEG2、H.264、H.265和 VP9,仅定义工具集和解码过程。这样做是为了确保跨各种设备的互操作性。每个解码器实现都必须对兼容的比特流进行解码,以提供与在同一输入上运行的其他实现的输出相同的输出。另一方面,编码器实现可以自由选择标准中可用的编码工具,并将它们作为设计的一部分进行调整,只要它们产生符合标准的输出视频即可。编码器实现可以设计并结合不同的预处理技术、编码工具选择和调整算法作为其设计的一部分。这可能会导致从一个编码器到另一个编码器的视频质量存在显着差异。

Video compression is essential to store and transmit video signals. Typically, video originates from a variety of sources like live sports and news events, movies, live video conferencing and calling, video games and emerging applications like augmented reality/virtual reality (AR/VR). While some applications like live events and video calling demand real-time video compression and transmission, others, like movie libraries, storage, and on-demand streaming are non-real time applications. Each of these applications imposes different constraints on the encoding and decoding process, resulting in differences in compression parameters. A live sports event broadcast requires high quality, real-time encoding with very low encoding and transmission latency, whereas encoding for a video-on-demand service like Netflix or Hulu is non-real time and focuses on highest quality and visual experience. To this effect, every video compression standard provides a variety of toolsets that can be enabled, disabled and tuned to suit specific requirements. All modern video compression standards, including MPEG2, H.264, H.265, and VP9, define only the toolsets and the decoding process. This is done to ensure interoperability across a variety of devices. Every decoder implementation must decode a compliant bitstream to provide an output identical to the outputs of other implementations operating on the same input. Encoder implementations, on the other hand, are free to choose coding tools available in the standard and tune them as part of their design, as long as they produce an output video that is standard-compliant. Encoder implementations can design and incorporate different pre-processing techniques, coding tool selection, and tuning algorithms as part of their design. This may result in dramatic differences in video quality from one encoder to another.

2.2 为什么需要视频压缩?

原始形式的视频表示需要很多比特。例如,对于一个未压缩的原始 UHD 分辨率视频(3840x2160 像素,60fps),3 个颜色分量为 10 位/像素,所需的带宽为:3840 x 2160 x 60 x 10 x 3 = 14.92 吉比特/秒( Gbps)。使用当今的互联网带宽传输此类数据而不进行任何处理是不切实际的,因为带宽最多为每秒几十或数百兆比特 (Mbps)。例如,如果我们必须通过 15 Mbps 的链路传输带宽为 14.92 Gbps 的 UHD 视频,则需要将其压缩 1000 倍。此外,此分辨率下的 5 分钟视频需要 559 GB如果以原始格式存储,则存储空间。想象一下尝试将此类视频下载到您的手机或平板设备上。

Video representation in its raw form takes a lot of bits. As an illustration, for an uncompressed, raw, UHD resolution video (3840x2160 pixels at 60fps) with 10 bits/pixel for 3 color components, the bandwidth needed would be: 3840 x 2160 x 60 x 10 x 3 = 14.92 gigabits per second (Gbps). It’s not practical to transmit such data using today’s internet bandwidth without any processing as bandwidth is at most a few tens or hundreds of megabits per second (Mbps). As an example, if we have to transmit the UHD video with the bandwidth of 14.92 Gbps over a 15 Mbps link, it would need to be compressed by a factor of 1000. Also, a 5-minute video at this resolution would need 559 GB of storage space if stored in its raw format. Imagine attempting to download such video onto your phone or tablet devices. These have storage available only in the range of from 16GB to 300GB.

显然,存储和传输未压缩视频提出了巨大的实际挑战。这几乎是不可能的。据估计,70% 的互联网流量是视频。[ 1]这个百分比正在迅速增加。技术也在不断进步,视频分辨率可以提高到 4K 和 8K。它们分别是 1080p 的四倍和八倍。其他技术改进包括将帧速率提高到 90fps 以及更高的新兴沉浸式应用程序和 HDR实施。这比 SDR 大 25-50%,并提供增强的用户体验。这些不断推动当今视频体验的极限,而压缩是使这些技术得以大规模部署的关键粘合剂。

Clearly, storing and transmitting uncompressed video present huge practical challenges. It is almost impossible. It is estimated that 70% of the internet traffic is video. [1] This percentage is rapidly increasing. There are also continuous technological advances, permitting increases in video resolutions to 4K and 8K. These are, respectively, four times and eight times larger than 1080p. Other technological improvements include increased frame rates to 90fps and beyond for emerging immersive applications and HDR implementations. This is 25-50% larger than SDR and provides an enhanced user experience. These continue to push the limits of today’s video experiences and compression is the crucial binding glue that enables these technologies to be deployed at scale.

因此,可以优化视频的高级视频压缩工具对于跟上视频基础设施的快速改进和可靠的视频传输以提供下一代消费者视频体验的改进至关重要。

Advanced video compression tools that can optimize video are thus essential for keeping up with the rapid improvements in video infrastructure and for reliable video transmission to provide next-generation improvements to consumer video experiences.

2.3 视频是如何压缩的?

视频信号包括从一个像素到下一个像素包含几乎相同值的像素。换句话说,在表示视频中的像素值时存在明显的冗余。压缩技术通过仔细考虑人类视觉系统模型来识别并仔细删除这些冗余。从广义上讲,这些冗余可以分为两种类型:

Video signals comprise pixels that contain nearly the same values from one pixel to the next. In other words, there is significant redundancy in representing pixel values in video. Compression techniques work to identify and carefully remove these redundancies, by carefully considering the human visual system model. Broadly, these redundancies can be classified into 2 types:

  1. 统计冗余
  2. Statistical redundancies
  3. 心理视觉冗余
  4. Psycho-visual redundancies

统计冗余是指视频序列内像素值分布中存在的固有冗余。这可能是由于视觉场景本身的性质。例如,如果场景是静态的,那么像素间的冗余信息要比高度纹理化的内容多得多。统计冗余还可以表现为这些像素最终在比特流中编码的方式。因此,统计冗余可以分为 3 种类型:

Statistical redundancy refers to the inherent redundancy that exists in the distribution of pixel values within the video sequence. This can be due to the nature of the visual scene itself. For example, if a scene is static, then there is far more redundant information across pixels than is the case with highly textured content. Statistical redundancy can also manifest in the way that these pixels are finally encoded in the bitstream. Statistical redundancy thus can be classified into 3 types:

  1. 空间像素冗余
  2. Spatial pixel redundancy
  3. 时间像素冗余
  4. Temporal pixel redundancy
  5. 编码冗余
  6. Coding redundancy
2.3.1 空间像素冗余

在视频场景中,同一图片(或帧)中靠在一起的像素非常相似。下面的图 10 显示了akiyo视频序列中的一个帧。我们在图 10 中看到,构成新闻播音员裙子的像素、她脸部的像素以及背景的像素非常相似。

In a video scene, the pixels that are close together within the same picture (or frame) are significantly similar. Figure 10, below, shows a frame from the akiyo video sequence. We see in Figure 10 that the pixels comprising the newsreader’s dress, those of her face, and those of the background are very similar.

从统计学上讲,靠近在一起的像素被认为表现出很强的相关性。

Statistically, the pixels that are close together are said to exhibit strong correlation.

图像

图 10:akiyo 序列[2]的一帧中像素的空间相关性图示

Figure 10: Illustration of spatial correlation in pixels in one frame of the akiyo sequence [2].

我们可以利用这种强相关性,只用与基值的差值来表示视频,因为这需要更少的比特来编码像素。这是差分编码如何应用于视频压缩的核心概念。所有现代视频编码标准都消除了这些空间冗余,只传输构成残差所需的最少比特视频。

We can exploit this strong correlation and represent the video with only a differential from a base value, as this requires far fewer bits to encode the pixels. This is the core concept of how differential coding is applied in video compression. All modern video coding standards remove these spatial redundancies to transmit only the minimal bits needed to make up the residual video.

例如,在图 11 中的 4x4 像素组中,需要 8 位来表示从 0 到 255 的值。每个像素都需要 8 位来表示。

For example, in the group of 4x4 pixels in Figure 11, 8 bits are needed to represent values from 0 through 255. Every pixel will need 8 bits for its representation.

图像

图 11:以像素为单位的空间相关性图示。

Figure 11: Illustration of spatial correlation in pixels.

因此,可以使用 128 位完整地表示 4x4 块。然而,我们注意到这个块中的像素值变化很小。如果我们将相同的信号表示为与基值(比如 240)的差分,那么块中的每个像素都可以使用 8 位表示基值和 2 位(从 0 到 3 的值)表示每个像素剩下的 15 个值,导致总共只需要 40 位来表示。选择的基值称为预测变量,来自该预测变量的有效差分值称为残差。这种使用差分编码去除同一图片内的空间像素冗余的机制称为图片内编码

Thus, the 4x4 block can be completely represented using 128 bits. We, however, notice that the pixel values in this block vary only by very small amounts. If we were to represent the same signal as a differential from a base value (say 240), then every pixel in the block can be represented using 8 bits for the base value and 2 bits (values from 0 through 3) for each of the remaining 15 values, resulting in a total of only 40 bits required for representation. The base value chosen is called a predictor and the effective differential values from this predictor are called residuals. This mechanism of using differential coding to remove spatial pixel redundancies within the same picture is called intra picture coding.

2.3.2 利用时间像素相关性

下面的图 12 显示了三张连续的图片,它们是akiyo新闻阅读器视频序列的一部分。除了空间相似性之外,这个数字视频序列中的连续图片在大多数情况下都惊人地相似且是静态的,除了新闻阅读者的眼睛和嘴巴周围有一些明显的差异。这可能随内容而变化,并且场景中可能有各种对象,并且不同对象在图片中的不同方向上移动。然而,核心思想是图片之间存在很强的相似性,也称为连续图片之间的时间相关性。这在图 12 中进行了说明。

Figure 12, below, shows three consecutive pictures that are part of the akiyo news reader video sequence. In addition to spatial similarities, the successive pictures in this digital video sequence are strikingly similar and static for the most part, except for some noticeable differences around the eyes and mouth of the news reader. This can vary with content and there can be a variety of objects in the scene and movement of different objects in different directions across pictures. However, the core idea is the presence of strong similarities among the pictures, also known as temporal correlation across successive pictures. This is illustrated in Figure 12.

图像

图 12:akiyo 序列的连续帧中时间相关性的图示[2]

Figure 12: Illustration of temporal correlation in successive frames of the akiyo sequence [2].

可以利用相邻图片之间像素的这种强相关性来仅使用帧的新信息对连续的视频帧进行编码,并删除先前图片中已经编码的冗余数据。这是通过仅编码来自先前帧中的像素的差分值来完成的。使用差分编码去除时间像素冗余的机制称为帧间图像编码

This strong correlation of pixels across nearby pictures can be exploited to encode successive video frames using only information that’s new to the frame and removing redundant data that has already been coded in previous pictures. This is done by coding only differential values from pixels in the previous frames. The mechanism of using differential coding to remove temporal pixel redundancies is called inter picture coding.

2.3.3 熵编码

在上面图 11 的像素矩阵中,所有残差需要正确传输值,以便能够从中推断出原始像素值。如果我们要分配 2 位来表示四个残差值中的每一个,我们将需要 2 位 x 16 个值 = 32 位来传输残差矩阵。然而,经过更仔细的观察,我们发现一些残差值在矩阵中比其他残差值出现得更频繁。例如,残值 3 出现了 8 次。因此,如果我们要为最常出现的值分配更少的位,而为很少出现的值分配更多的位,而不是为 4 个值中的每一个分配相同的位,这可能会进一步减少所需的总位数. 这种在位分配期间利用统计冗余来实现压缩的原理称为编码。这种技术是完全无损的,这意味着可以从编码值中准确地构建原始像素数据。

In the pixel matrix in Figure 11, above, all residual values need to be transmitted correctly in order to be able to deduce the original pixel values from them. If we were to assign 2 bits to represent each of the four residual values, we would need 2 bits x 16 values = 32 bits to transmit the matrix of residuals. However, upon closer observation, we find that some residual values occur more frequently than others in the matrix. For example, the residual value 3 occurs 8 times. Thus, instead of an equal assignment of bits for each of the 4 values, if we were to assign fewer bits to the most frequently occurring values and more bits to rarely occurring values, this could result in further reduction in the total number of bits needed. This principle of utilizing statistical redundancies to achieve compression during bit allocation is called entropy encoding. This technique is completely lossless, meaning the original pixel data can be accurately constructed from the coded values.

例如,如果我们使用表 4 中所示的矩阵为四个值中的每一个分配位,我们可以表示所有残差总共使用 4 x 2 + 2 x 3 + 2 x 3 + 8 x 1 = 28 位的值。

For example, if we use the matrix shown in Table 4 to assign bits to each of the four values, we can represent all the residual values using a total of 4 x 2 + 2 x 3 + 2 x 3 + 8 x 1 = 28 bits.

需要注意的是,上面讨论的统计冗余消除方法,除了对压缩有极大帮助外,也都是完全无损的。在下一节中,我们将讨论一些利用心理视觉冗余的方法,这些冗余会导致有损编码。这些方法提供了现代视频编解码器可实现的大部分压缩增益。

It should be noted that the statistical redundancy elimination methods discussed above, in addition to being extremely helpful in compression, are all also completely lossless. In the following section, we will discuss a few methods which exploit psycho-visual redundancies which will result in lossy coding. These methods provide the bulk of compression gains achievable with modern video codecs.

表 4:每个值的不等位数的示例分配。

Table 4: Sample assignment of an unequal number of bits for every value.

剩余价值

Residual Value

代码

Code

位数

Bit count

0

0

1个

1

0

0

 

 

2个

2

1个

1

1个

1

1个

1

0

0

3个

3

2个

2

1个

1

1个

1

1个

1

3个

3

3个

3

0

0

 

 

 

 

1个

1

 

 

2.3.4 利用心理视觉冗余

研究表明,人类视觉系统(HVS) 对亮度更敏感信息 (Y) 比色度信息 (U 和 V)。这意味着减少分配给色度分量的位数对视觉体验的影响将明显低于亮度位数的相应减少。通过利用这种视觉感知现实,所有现代视频编码标准都通过对色度分量进行子采样来使用较低分辨率的色度,同时保持亮度分量的全分辨率。以下是最常用的格式:

Studies have shown that the human visual system (HVS) is much more sensitive to luminance information (Y) than chrominance information (U and V). This means that a reduction in the number of bits allocated for the chroma component will have a significantly lower impact on the visual experience than a corresponding reduction in luma bits. By exploiting this visual perceptual reality, all modern video coding standards use a lower resolution of chroma by subsampling the chroma components, while maintaining full resolution of luminance components. The following are the most commonly used formats:

  1. 4:2:0:色度在 H 和 V 方向上二次采样 ½
  2. 4:2:0: Chroma subsampled by ½ across H and V directions
  3. 4:2:2:U 和 V 仅在 H 方向上二次采样 ½
  4. 4:2:2: Both U & V subsampled by ½ across H direction only
  5. 4:4:4:U 和 V 的全分辨率,没有任何子采样
  6. 4:4:4: Full resolution for U and V without any subsampling

4:2:2 和 4:2:0 子采样机制如图 13 所示,用于 8x8 像素块示例,其中使用全亮度 (Y) 分辨率,但色度分量(Cb 和 Cr)均按指示进行采样由阴影像素。由于 4:2:2 仅沿水平分辨率对色度进行子采样,因此使用沿水平行的每隔一个像素位置。对于4:2:0,除了上面的水平二次采样外,还使用了垂直二次采样。因此,使用在对应亮度像素位置的两个连续行之间的像素位置。

The 4:2:2 and 4:2:0 subsampling mechanism is illustrated in Figure 13 for a sample 8x8 block of pixels, where full luma (Y) resolution is used but the chroma components (Cb and Cr) are both sampled as indicated by the shaded pixels. As 4:2:2 subsamples chroma along horizontal resolution only, every other pixel location along the horizontal rows is used. For 4:2:0, in addition to the above horizontal subsampling, vertical subsampling is also used. Hence, a pixel location that is between two consecutive rows of corresponding luma pixel locations is used.

每个视频分发系统,例如卫星上行链路、电缆或互联网视频,都只使用 4:2:0。然而,在现场事件捕捉和新闻制作中心等专业设施中,4:2:2 格式用于捕捉、处理和编码视频以保持视频保真度。

Every video distribution system, for instance, satellite uplinks, cable or internet video, use 4:2:0 exclusively. In professional facilities like live event capture and news production centers, however, 4:2:2 format is used to capture, process, and encode video to preserve video fidelity.

图像

图 13:像素的 4:2:2 和 4:2:0 子采样。

Figure 13: 4:2:2 and 4:2:0 subsampling of pixels.

图像

图 14:HVS对低频天空和高频树木区域的敏感性。

Figure 14: HVS sensitivity to the low-frequency sky, and high-frequency trees, areas.

此外,人眼对亮度的微小变化很敏感在大面积但对快速亮度变化(高频亮度)不太敏感。在图 14 所示的原始图像中,天空周围的区域对应于大而平滑的低频区域,而树木中的纹理对应于高频亮度区域。当我们开始去除图片中的细节时,如图 15 所示,眼睛注意到天空、湖泊和小路等较平滑区域的变化比高频草地和树木区域的变化更快。感知系统更能容忍后者的变化。

Additionally, the human eye is sensitive to small changes in luminance over a large area but not very sensitive to rapid luminance changes (high-frequency luminance). In the original image shown in Figure 14, the area around the sky corresponds to large, smooth, low-frequency areas, whereas the texture in the trees corresponds to high-frequency luminance areas. When we begin to remove details in the picture as shown in Figure 15, the eyes notice the changes in the smoother areas like the sky, lake, and pathways more quickly than they do the changes in high-frequency grass and tree areas. The perceptual system is more tolerant of changes in the latter.

图像

图 15:细节缺失在大而平滑的区域(如天空)中更为突出[3]

Figure 15: Lack of details are more prominent in large and smooth areas like the sky [3].

这非常重要。这意味着通过优先考虑低频分量而不是高频分量可以获得显着的收益。所有视频标准都采用两种重要技术来有效实现这一点。

This is hugely important. What it means is that significant gains can be achieved by prioritizing low-frequency components over high-frequency components. All video standards employ two important techniques to effectively achieve this.

  1. 转换编码这是将亮度色度分量从像素域转换为称为变换域的不同表示的过程。
  2. Transform Coding. This is a process to convert luma and chroma components from the pixel domain to a different representation called the transform domain.
  3. 量化. 使用这种技术,可以更好地优先处理和保留低频分量。选择性地忽略高频分量
  4. Quantization. Using this technique, low-frequency components are prioritized and preserved better. High-frequency components are selectively ignored.
2.3.5 8 位与 10 位编码

使用更少的比特来表示视频像素会产生更好的压缩效果,这似乎是显而易见的。然而,研究表明 10 位编码能够提供比 8 位编码更好的质量,无论源内容的位深度如何[4] 虽然这听起来违反直觉,但在下文中我将解释为什么会这样。

It may seem obvious that using fewer bits to represent the video pixels will result in better compression. However, studies have shown that 10-bit encoding is capable of providing better quality over 8-bit encoding, regardless of the bit depth of the source content [4]. While this sounds counterintuitive, in the following I explain why it may be so.

如果内容是 10 位,那么编码之前和解码之后的过滤阶段可能会破坏 8 位编码的精细细节。但是,这些在 10 位编码中得到保留和利用。另一方面,如果源内容是 8 位但使用 10 位编码,则内部编码过程(如转换和过滤器)至少使用 10 位精度。这导致更小的舍入误差,尤其是在运动补偿滤波过程中,从而提高了预测效率。通过更准确的预测(使用更少的比特),将需要更低级别的量化来实现目标比特率。所有这一切的最终结果是卓越的视觉质量。

If the content is 10-bit, then the filtering stages before encoding and after decoding could potentially destroy fine details with 8-bit encoding. However, these are preserved and leveraged in 10-bit encoding. On the other hand, if the source content is 8-bit but is encoded using 10-bits, the internal encoding process (like transforms and filters) uses at least 10-bit accuracy. This results in lesser rounding errors, especially in the motion compensated filtering process, thereby increasing the prediction efficiency. With more accurate prediction (which uses fewer bits), lower levels of quantization will be needed to achieve the target bitrate. The ultimate result of all this is superior visual quality.

2.4 概括
  • 原始未压缩视频的存储和传输会消耗大量带宽(例如,3840x2160p60 一秒钟需要 7.46 Gbits)。这是不切实际的,尤其是随着视频分辨率的提高和消费量的增长。因此,视频需要压缩。
  • Storage and transmission of raw uncompressed video consume enormous bandwidth (e.g. 3840x2160p60 takes 7.46 Gbits for one second). This is not practical, especially as video resolutions are increasing and consumption is growing. Hence, video needs to be compressed.
  • 视频像素与显着冗余相似,可分为两种类型:统计和心理视觉。压缩技术用于识别和删除这些冗余。
  • Video pixels are similar with significant redundancies, classifiable into 2 types: statistical and psycho-visual. Compression techniques work to identify and remove these redundancies.
  • 统计冗余分为空间像素冗余(帧内跨像素)、时间像素冗余(跨帧)和编码冗余。
  • Statistical redundancies are classified into spatial pixel redundancies (across pixels within a frame), temporal pixel redundancies (across frames) and coding redundancies.
  • 人眼对亮度比对色度更敏感。因此,在压缩技术中亮度优先于色度。
  • The human eye is more sensitive to luma than to chroma. Therefore, luma is prioritized over chroma in compression techniques.
  • 人眼对大面积的微小亮度变化很敏感,但对快速的亮度变化不太敏感。因此,在压缩过程中,低频分量优先于高频分量。
  • The human eye is sensitive to small changes in brightness over a large area but not very sensitive to rapid brightness variations. Therefore, low-frequency components are prioritized over high-frequency components during compression.
2.5 笔记
  1. Wong J I. 互联网已经悄然重新布线,视频就是其中的原因。石英痴迷。 https://qz.com/742474/how-streaming-video-changed-the-shape-of-the-internet/。2016 年 10 月 31 日发布。2018 年 9 月 21 日访问。
  2. Wong J I. The internet has been quietly rewired, and video is the reason why. Quartz Obsessions. https://qz.com/742474/how-streaming-video-changed-the-shape-of-the-internet/. Published October 31, 2016. Accessed September 21, 2018.
  3. 秋叶。xiph.org. Xiph.org 视频测试媒体 [derf 的收藏]。https://media.xiph.org/video/derf/。2018 年 9 月 21 日访问。
  4. akiyo. xiph.org. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed September 21, 2018.
  5. in_to_tree。xiph.org. Xiph.org 视频测试媒体 [derf 的收藏]。https://media.xiph.org/video/derf/。2018 年 9 月 21 日访问。
  6. in_to_tree. xiph.org. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed September 21, 2018.
  7. 为什么 10 位可以节省带宽(即使内容是 8 位)?爱美。www.ateme. com。 http://x264.nl/x264/10bit_02-ateme-why_does_10bit_save_bandwidth.pdf。2010 年出版。2018 年 9 月 21 日访问。
  8. Why Does 10-bit Save Bandwidth (Even When Content is 8-bit)? ATEME. www.ateme.com. http://x264.nl/x264/10bit_02-ateme-why_does_10bit_save_bandwidth.pdf. Published 2010. Accessed September 21, 2018.
3个 编解码器的演变
3.1 编码研究的重大突破

自 1980 年代以来的几十年里,视频编码技术一直在不断进步,在基于混合块的架构上构建了一套成功的编解码器。构成该架构的核心技术经过数十年的发展,早在 1940 年代就开始研究。这些支撑了架构演变为现在的形式。今天的编解码器仍然建立在许多重要的研究发展之上,尤其是 1970 年代和 80 年代[1]. 本节的重点是了解这些核心技术是什么,以及它们如何影响和促进现代视频编码标准的发展。这为我们提供了宝贵的见解,让我们了解为什么编码技术是今天的样子。它还有助于我们在更高层次上理解视频编码的基本框架。

Video encoding technologies have consistently progressed over several decades since the 1980s with a suite of successful codecs built on the hybrid block-based architecture. The core technologies that constitute this architecture were developed over decades, starting from research as early as the 1940s. These underpinned the evolution of the architecture into its present form. Today’s codecs still build on many significant research developments, especially from the 1970s and 1980s [1]. The focus of this section is to look at what these core technologies are and how they influenced and contributed to the evolution of the modern video coding standards. This provides us with valuable insights on why coding technologies are the way they are today. It also helps us understand, at a higher level, the fundamental framework of video coding.

3.1.1 信息论(熵编码)

人们普遍认为,克劳德·香农 (Claude Shannon) 1948 年的经典论文“通信的数学理论”(A Mathematical Theory of Communication) 的发表是通信系统的里程碑式突破,也是信息论的基础。使用统计过程的数字通信模型。他还介绍了的概念计算传输消息中的信息量从而能够计算无损数据通信的限制。这些概念作为几种熵编码技术的基石,从 David Huffman 1952 年的论文“A Method for the Construction of Minimum Redundancy Codes”[3] 开始,该论文描述了一种使用有限符号集有效编码消息方法使用可变长度的二进制代码。霍夫曼编码方法,正如我们今天所知,已广泛用于视频编解码器,从 H.261 开始一直到 H.264 。威滕等人。1987 年的论文“Arithmetic Coding for Data Compression” [4]提供了霍夫曼编码方法的替代方法。他们的技术提高了压缩效率,并形成了视频编码标准(包括 H.264、H.265和 VP9)中使用的 CABAC 编码的基础。

It is widely accepted that the landmark breakthrough for communication systems that also laid the foundation of information theory was the publication of Claude Shannon’s classic 1948 paper, “ A Mathematical Theory of Communication."[2] In this paper, Shannon was able to provide a model for digital communication using statistical processes. He also introduced the concept of entropy to calculate the amount of information in a transmitted message, thereby enabling calculation of the limits of lossless data communication. These concepts served as the building blocks for several entropy coding techniques, starting with David Huffman’s 1952 paper, “A Method for the Construction of Minimum Redundancy Codes,"[3] that described a method of efficiently encoding a message with a finite set of symbols using variable length binary codes. The Huffman coding method, as we know it today, has been used extensively in video codecs, starting with H.261 and continuing up to H.264. Witten's, et al. 1987 paper, "Arithmetic Coding for Data Compression"[4] provided an alternative to the Huffman coding method. Their technique improved compression efficiency and has formed the basis of CABAC encoding used in video coding standards, including H.264, H.265, and VP9.

3.1.2 预言

考虑到视频信号的性质,早在 1950 年代就开始努力使用某种形式的预测来表示视频数据,以最大程度地减少冗余,从而减少传输的数据量。1972 年,贝尔实验室的 Manfred Schroeder 获得了一项专利,“图像差异信号的变换编码” [5],该专利探索了几种现代视频编解码器概念,包括帧间预测、图像信号的变换和量化。Schroeder 的工作还特别提到了傅里叶、l-Hadamard 和其他酉矩阵变换的应用,这些变换有助于将差异数据均匀地分散在变换变量的域中。

Given the nature of video signals, efforts to represent video data using some form of prediction, in order to minimize the redundancies and thereby reduce the amount of transmitted data, began as early as the 1950s. In 1972, Manfred Schroeder of Bell Labs obtained a patent, "Transform Coding of Image Difference Signals,"[5] that explored several of the modern video codec concepts, including inter-frame prediction, transforms and quantization to image signals. Schroeder's work also specifically mentions the application of Fourier, l-Hadamard, and other unitary matrix transforms that help to disperse the difference data homogeneously in the domain of the transformed variable.

图像

图 16:Manfred Schroeder 的预测编码技术。

Figure 16: Manfred Schroeder’s predictive coding techniques.

图片来源:https://patents.google.com/patent/US3679821

Image source: https://patents.google.com/patent/US3679821

虽然 Schroeder 的专利引用了帧间预测,但它并未具体涉及运动补偿预测的现代概念。这是由 Netravali 和 Stuller 在他们 1981 年的专利变换域中视频信号的运动估计和编码” [6]中引入的,该专利描述了运动估计和运动补偿技术用于预测编码。在这项工作中,第一个编码步骤是线性变换技术,如 Hadamard 变换,然后是在变换域中执行的运动补偿预测。从那以后,这些原则构成了所有现代视频编解码器的支柱。

While Schroeder’s patent referenced inter frame prediction, it didn't specifically involve the modern-day concept of motion compensated prediction. This was introduced by Netravali and Stuller in their 1981 patent, "Motion Estimation and Encoding of Video Signals in the Transform Domain,"[6] that described the techniques of motion estimation and motion compensation for predictive coding. In this work, the first encoding step is a linear transform technique like Hadamard Transform, followed by motion compensated prediction that is carried out in the transform domain. These principles have since formed the backbone of all modern-day video codecs.

图像

图 17:Netravali 和 Stuller 在变换域中的运动补偿预测。

Figure 17: Netravali and Stuller’s motion compensated prediction in the transform domain.

资料来源:https://patents.google.com/patent/US4245248A/

Source: https://patents.google.com/patent/US4245248A/

3.1.3 变换编码

下一个重要突破是在处理和传输之前实现源视频(或图像)像素的去相关。之前的工作包括参考各种变换技术来实现这一点。在他们 1974 年的论文“离散余弦变换”中,[7] Ahmed、Natarajan 和 Rao 介绍了 DCT转换图像处理。正如他们解释的那样,DCT 变换非常接近理论上最优的 Karhunen-Loeve 变换(KLT)的性能。DCT 和 DCT 的整数变体已被用于 MPEG 标准中的变换编码,包括 MPEG1、MPEG2、MPEG4-Part2、H.264和 H.265

The next important breakthrough was on achieving decorrelation of the source video (or image) pixels before processing and transmission. Prior work included references to various transform techniques to achieve this. In their 1974 paper, "Discrete Cosine Transform,"[7] Ahmed, Natarajan, and Rao introduced the DCT transform for image processing. As they explained, the DCT transform closely approximates the performance of the theoretically optimal Karhunen-Loeve Transform (KLT). The DCT and integer variants of the DCT have been adopted for transform coding in MPEG standards, including MPEG1, MPEG2, MPEG4-Part2, H.264, and H.265.

这些发展和技术共同构成了当今编解码器中使用的基于混合块的视频编码架构的支柱。这种基本的视频架构已经过改进,可以生成更新的编解码器,从而提高压缩效率。下一节将介绍这些编码工具如何以及何时用于各种视频编码标准的开发。

Together, these developments and techniques have formed the pillars of the hybrid block-based video coding architecture used in today’s codecs. This basic video architecture has been improved upon to produce newer codecs that deliver improved compression efficiency. How and when these coding tools were used in the development of various video coding standards is presented in the following section.

3.2 视频编码标准的演变

大多数视频编解码器都是国际标准集合的一部分,了解这些标准是如何形成的以及它们是如何随着时间的推移而演变的是很有用的。视频编码标准是描述视频比特流格式和解码比特流的相关机制的文档。它只定义解码器。它不指定编码器,允许编码器研究和实现的灵活性,以提供规定格式的压缩视频。解码器定义中还包括可用于创建兼容比特流的编码工具。该文件通常由 ISO 和/或 ITU-T 等国际标准机构批准和采用。这些机构先后这样做,产生了一系列 MPEG 视频编码标准。使设备可以互操作。这意味着一个制造商制造的编码器将产生可以由任何其他制造商制造的解码器解码的比特流。这也使消费者能够自由选择来自不同制造商的设备,从而使市场竞争激烈。

Most video codecs are part of an international collection of standards and it is useful to learn about how these standards were formed and how they have evolved over time. A video coding standard is a document that describes the video bitstream format and an associated mechanism to decode the bitstream. It only defines the decoder. It leaves the encoder unspecified, allowing flexibility in encoder research and implementation to provide the compressed video in the prescribed format. Included in the decoder definition are also coding tools that can be used for creating the compliant bitstream. This document is usually approved and adopted by an international standards body like ISO and/or ITU-T. These bodies have done this successively to produce a host of MPEG video coding standards. The standardization process ensures that there is a fixed reference that can be used by designers and manufacturers of encoders and decoders to make the devices interoperable. This means encoders made by one manufacturer will produce bitstreams that can be decoded by decoders made by any other manufacturer. This also allows freedom for the consumer to be able to choose devices from different manufacturers, making the market highly competitive.

多年来,标准化过程也变得相当标准。一旦了解了包括目标应用程序和比特率在内的要求,就会开始一个开发过程,从个人或公司那里寻求技术算法的贡献。如果可用,将根据一组标准对这些进行竞争性的性能分析,并选择一些进行最终确定。然后生成标准文件草案。它包括这些选定的算法和技术,在成功进行合规性测试后,这些算法和技术会演变成最终标准。

Over the years, the process of standardization has also become fairly standard. Once the requirements, including target applications and bitrates, are understood, a development process ensues where technical algorithm contributions are sought from individuals or corporations. When available, these are analyzed competitively for performance against a set of criteria and some are selected for finalization. The draft standard document is then generated. It includes these selected algorithms and techniques that, upon successful compliance testing, evolve into the final standard.

虽然 ISO 和 ITU-T 已经标准化并发布了多种视频编码标准,包括所有 MPEG 和 H.26x 视频标准,但由谷歌等技术公司牵头的并行机制已经制定了其他视频标准,如 VP8 和 VP9。这种机制类似于传统标准机构用来开发和测试新算法的机制。但是,标准文档和软件通常由 Microsoft 和 Google 等公司发布。该标准可能会或可能不会被标准机构采用和发布。(Microsoft 的 VC-1 编解码器被采纳为 SMPTE 标准。)其中一些视频编解码器,如 VP8 和 VP9,也作为免费开源资源向公众提供。这种机制也经过多年的发展,导致形成了一个主要公司的联盟,为编码标准和资源的开发和共享使用做出贡献。以下部分提供了有关流行视频编码格式的时间表和发展的见解。

While ISO and ITU-T have standardized and published several video coding standards, including all the MPEG and H.26x video standards, a parallel mechanism, spearheaded by technology companies like Google and others, has produced other video standards like VP8 and VP9. This mechanism is similar to the one used by traditional standards bodies to develop and test new algorithms. However, the standard document and software are usually published by corporations like Microsoft and Google. The standard may or may not then be adopted and published by a standards body. (Microsoft’s VC-1 codec was adopted as a SMPTE standard.) Some of these video codecs, like VP8 and VP9, are also available as free open-source resources to the public. This mechanism has also evolved over the years, resulting in the formation of an alliance of major companies that contribute to the development and shared usage of the coding standard and resources. The following section provides insights on the timeline and development of popular video coding formats.

3.2.1 时间表和发展

两个主要的标准机构,即 ISO/IEC 和 ITU-T,多年来发布了大部分视频编码标准。运动图像专家组 (MPEG) 一直是 ISO/IEC 标准的工作组,而视频编码专家组(VCEG) 一直是 ITU-T 下的工作组。ISO/IEC 制定了流行的 MPEG 系列标准,而 ITU-T 通过单独的努力制定了竞争性的 H.26x 系列标准。然而,随着联合视频团队 (JVT) 的成立,这种安排发生了变化,这是 VCEG 和 MPEG 之间的合作,它们共同开发了 H.264或 MPEG-4 Part-10 视频编码标准。这是一个重要的里程碑,因为它汇集了来自早期 H.26x 和 MPEG 标准的多种编码工具。基于 H.264 的高效视频编码(HEVC 或 H.265 或 MPEG-H Part2)标准也由视频编码联合协作小组 (JCT-VC) 发布,该小组是同一组的协作成果.

Two major standards bodies, namely, ISO/IEC and ITU-T, have, over the years, published the majority of video coding standards. The Motion Pictures Experts Group (MPEG) has been the working group for ISO/IEC standards and the Video Coding Experts Group (VCEG) has been the working group under the ITU-T. While the ISO/IEC produced the popular MPEG series of standards, ITU-T produced, through separate efforts, the competing H.26x series of standards. This arrangement, however, changed with the formation of the Joint Video Team (JVT), a collaboration between VCEG and MPEG that worked together and developed the H.264 or MPEG-4 Part-10 video coding standard. It was an important milestone as it brought together several coding tools from earlier H.26x and MPEG standards. The High Efficiency Video Coding (HEVC or H.265 or MPEG-H Part2) standard, that builds upon H.264, is also published by the Joint Collaborative Team on Video Coding (JCT-VC), a collaborative effort by the same groups.

3.2.1.1 MPEG 视频标准

MPEG1 视频于 1993 年标准化,用于 VCD 上的数字视频存储,VHS 质量视频的比特率目标为 1.5 Mbps。它包括对带有 YUV 420 的 CIF 分辨率(352x288 像素)的支持。这已扩展到 MPEG2 视频,该视频于 1995 年标准化,是第一个商业上成功的广播应用编解码器,目标是 3-20 Mbps 的高比特率并包括高清视频分辨率。MPEG2 彻底改变了视频传输和存储,也推动了我们今天所知的数字电视的发展。它也被用作 DVD 的存储标准,并且该标准至今仍在广泛使用。MPEG4 Part 2 是下一个标准,它建立在与 MPEG2 相同的原则之上,并具有额外的编码工具。它针对网络流媒体和视频通话等低比特率应用程序。它还包括与 ITU-T 的 H.263 标准的兼容性。它取得的成功有限,很快被下一代 MPEG4-Part 10 或 H.264 取代标准。

MPEG1 Video was standardized in 1993 and was used for digital video storage on VCDs with a bit rate target of 1.5 Mbps for VHS quality video. It included support for CIF resolution (352x288 pixels) with YUV 420. This was extended to MPEG2 video which was standardized in 1995 and was the first commercially successful codec for broadcast applications targeting high bit rates of 3-20 Mbps and included HD video resolutions. MPEG2 revolutionized video transmission and storage and also pushed the evolution of digital television as we know it today. It was also used as the standard of storage on DVDs and the standard is still extensively in use today. MPEG4 Part 2 was the next standard that was built on the same principles as MPEG2, with additional coding tools. It targeted low bit rate applications like web streaming and video calling. It also included compatibility with ITU-T’s H.263 standard. It had limited success and was soon replaced by the next generation MPEG4-Part 10 or H.264 standard.

3.2.1.2 H.26x 视频标准

在制定MPEG标准的同时,ITU-T旗下的VCEG发布了H.26x标准,包括H.261、H.262和H.263。H.261 早于 MPEG1,并于 1988 年标准化。它支持 CIF (352x288) 和 QCIF (176x144) 分辨率以及 YUV 4:2:0 格式,适用于低延迟视频会议应用。H.262 被标准化为 MPEG2 视频标准,建立了两个标准机构之间的协作。H.263 于 1996 年标准化,并建立在 H.261 的基础上,为视频会议应用程序提供增强功能。

In parallel with the development of the MPEG standards, the VCEG, under the ITU-T, published the H.26x standards, including H.261, H.262, and H.263. H.261 predated MPEG1 and was standardized in 1988. It supported CIF (352x288) and QCIF (176x144) resolutions with YUV 4:2:0 format for low latency video conferencing applications. H.262 was standardized as the MPEG2 video standard, establishing the collaboration between the two standards bodies. H.263 was standardized in 1996 and built upon H.261 to provide enhancements for video conferencing applications.

3.2.1.3 联合协作标准

承诺在保持相同视频质量的同时,将比特率比 MPEG2 进一步降低 50%, H.264或 AVC 视频标准于 2000 年代初联合开发,并于 2003 年由 JVT 标准化。这是非常成功的,并已广泛部署到各种应用程序中,从线性广播、互联网视频流和存储应用程序(如蓝光光盘)。它已成为互联网事实上的编解码器。H.264 的成功促成了 2013 年标准化的H.265 或 HEVC 视频标准开发的相同模型。虽然 HEVC 保留了 H.264 的基本结构,但它增加了重大改进,从而减少了50% 的比特率,同时保持可比的视频质量。然而,高效编码的代价是编码器和解码器的算法都非常复杂。

With the promise to reduce the bit rate further by 50% over MPEG2 while maintaining the same video quality, the H.264 or AVC video standard was jointly developed in the early 2000s and standardized in 2003 by the JVT. This was hugely successful and has been deployed widely for a variety of applications, ranging from linear broadcasting, internet video streaming and storage applications like Blu-Ray Disks. It has become the de facto codec of the internet. The success of H.264 led to the same model for the development of H.265 or HEVC video standard that was standardized in 2013. While HEVC retained the basic structure of H.264, it added significant improvements that resulted in a reduction of the bitrate of 50% while maintaining comparable video quality. The high efficiency coding, however, came at the expense of significantly high complexity of algorithms for both encoder and decoder.

下一个视频编解码器称为 VVC(通用视频编码),由联合视频专家组(JVET) 开发,目标是将比特率比 H.265 进一步降低 50% 。这将为 4K、8K 甚至更高分辨率的编码提供丰富的改进,还将针对 360 度和高动态范围 (HDR ) 视频等应用。预计初稿将于 2019 年末发布,该标准预计将在 2020 年末至 2021 年初的某个时间段内准备就绪。

The next video codec, called VVC (Versatile Video Coding), is being developed by The Joint Video Experts Team (JVET) with a goal of decreasing the bit rate by a further 50% over H.265. This will provide rich improvements to encode 4K, 8K and even higher resolutions and will also target applications like 360-degree and high-dynamic-range (HDR) video. With the first draft expected by late 2019, the standard is expected to be ready somewhere in the timeframe of late 2020 to early 2021.

3.2.1.4 开源视频格式

迄今为止,虽然 MPEG 和 H.26x 视频标准在开发、实施和部署方面占据了最大份额,但多年来其他流行的视频格式也取得了显著成果,尤其是在视频传输从传统方法过渡到互联网处理和交付的过程中。其中包括 Google 的 VP8、VP9 和新兴的 AV1标准。VP8 最初由 On2 Technologies 开发,后来被 Google 收购,并于 2010 年作为开放且免版税的编解码器发布。VP9 是 VP8 的后继者,由 Google 开发并扩展了 VP8 的编码工具。它主要针对 YouTube 流媒体,但后来扩展到其他互联网流媒体平台,包括 Netflix。此后,谷歌还与视频行业的几家公司(包括 Netflix、Mozilla、思科等)组成了开放媒体联盟 (AOM),带头推动免版税视频编码格式的发展。该联盟的第一个联合编解码器 AV1 于 2018 年发布,虽然主要基于 VP9,但基于 Xiph/Mozilla 的 Daala、思科的 Thor 和谷歌自己的 VP10 编码格式构建。

While MPEG and H.26x video standards have had the lion's share of development, implementation and deployment thus far, other popular video formats have also evolved to prominence over the years, especially as video transmission is transitioning from traditional methods to internet processing and delivery. These include Google’s VP8, VP9, and the emerging AV1 standard. VP8 was originally developed by On2 Technologies and later acquired by Google and released as an open and royalty free codec in 2010. VP9, the successor to VP8 was developed by Google and expands on VP8’s coding tools. It was primarily targeted at YouTube streaming but has since been expanded to other internet streaming platforms, including Netflix. Google has since also spearheaded the drive for royalty-free video coding formats by forming the Alliance for Open Media (AOM) along with several firms in the video industry, including Netflix, Mozilla, Cisco, and others. The alliance’s first joint codec, AV1, is released in 2018 and, while largely based on VP9, has been built on Xiph's/Mozilla's Daala, Cisco’s Thor and Google’s own VP10 coding formats.

表 5:视频编码标准演变的时间表。

Table 5: Timelines of the evolution of video coding standards.

Year

标准

Standard

应用

Applications

1988

1988

H.261

H.261

视频会议

Video conferencing

1992

1992

MPEG1

MPEG1

存储(VCD)

Storage (VCDs)

1995

1995

H.262/MPEG2

H.262/MPEG2

存储(DVD)&广播传输

Storage (DVD) & broadcast transmission

1996年

1996

H.263

H.263

低比特率视频会议

Low bitrate video conferencing

1999

1999

MPEG4-Part2

MPEG4-Part2

低比特率应用程序,如网络流媒体

Low bitrate applications like web streaming

2003年

2003

H.264/MPEG4-第 10 部分

H.264/MPEG4-Part 10

视频流、存储和广播传输

Video streaming, storage and broadcast transmission

2008年

2008

VP8

VP8

互联网视频流

Internet video streaming

2013

2013

H.265/MPEGH-第 2 部分

H.265/MPEGH-Part 2

视频流、存储和广播传输

Video streaming, storage and broadcast transmission

2013

2013

VP9

VP9

互联网视频流

Internet video streaming

2018

2018

AV1

AV1

互联网视频流

Internet video streaming

在这一演变过程中,每一代都在上一代的基础上引入了新的工具集,这些工具集主要关注降低比特率、降低解码器复杂性、支持更高的分辨率、多视图编码、HDR 等新技术,以及错误恢复能力的改进,以及其他增强功能。表 5 汇总了视频编码标准演变的详细信息。

Over the course of this evolution, every generation has built on top of the previous generation by introducing new toolsets that focus primarily on reduction of bit rates, reduction of decoder complexity, support for increased resolutions, newer technologies like multi-view coding, HDR, and improvements in error resilience, among other enhancements. Table 5 consolidates the details of the evolution of video coding standards.

3.2.2 MPEG2、H.264 、H.265对照表

除了视频编解码器的开发时间表和目标应用之外,回顾每一代编解码器为提供压缩效率而添加的组成工具集也很重要。我们将通过比较三种流行的视频编码标准来探索编码工具集的演变。

Along with the timelines of development and target applications of video codecs, it’s also important to review the constituent tool sets that every generation of codec added to deliver compression efficiency. We will explore the evolution of coding toolsets by comparing three popular video coding standards.

这将提供这种技术演变的代表性概述。各种编码工具的细节将在后续章节中介绍。因此,即使此时以下术语可能意义不大,表 6 中的总结也可以提供构成每个现代编解码器的具体细节的广泛概述。

This will provide a representative overview of this technological evolution. The details of various coding tools will be covered in subsequent chapters. Thus, even if the following terms may not mean much at this point, the summary in Table 6 can serve to provide a broad overview of the nuts and bolts that constitute every modern codec.

表 6:现代视频编码标准中工具集的比较。

Table 6: Comparison of toolsets in modern video coding standards.

编码工具

Coding Tool

MPEG 2

MPEG 2

H.264

H.264

H.265

H.265

块大小

Block size

16x16 的宏块

macroblock of 16x16

16x16 的宏块

macroblock of 16x16

可变的 8x8 到 64x64 CTU

variable 8x8 to 64x64 CTUs

分区

Partitioning

16x16

16x16

从 16x16 到 4x4 的可变分区

variable partitions from 16x16 down to 4x4

从 64x64 到 4x4 的可变分区

variable partitions from 64x64 down to 4x4

转换

Transforms

浮点 DCT基于 8x8 变换

floating point DCT based 8x8 transforms

4x4 和 8x8 整数 DCT变换

4x4 and 8x8 integer DCT transforms

可变 32x32 到 4x4 整数 DCTtransforms + 4x4 整数 DST 变换

variable 32x32 down to 4x4 integer DCT transforms + 4x4 integer DST transform

帧内预测

Intra prediction

变换域中的 DC 预测

DC prediction in the transform domain

9个方向的空间像素预测

spatial pixel prediction with 9 directions

35个方向的空间像素预测

spatial pixel prediction with 35 directions

亚像素插值

Sub pixel interpolation

½ 像素双线性滤波器

½ pixel bilinear filter

½ 像素六阶滤波器和 ¼ 像素双线性滤波器

½ pixel six-tap filter and ¼ pixel bilinear filter

¼ 像素八抽头滤波器 Y 和⅛ 像素四抽头 UV

¼ pixel eight-tap filter Y and ⅛ pixel four- tap UV

过滤

Filtering

无环路滤波

no in-loop filtering

环路解块筛选

in-loop deblocking filter

环路解块和 SAO 滤波器

in-loop deblocking and SAO filter

熵编码

Entropy coding

可见光通信

VLC

CAVLC 和 CABAC

CAVLC and CABAC

民航总局

CABAC

块跳过模式

Block skip modes

没有任何

none

直接模式

direct modes

合并模式

Merge modes

运动矢量预测

Motion vector prediction

来自一个邻居的空间 MV 预测

spatial MV prediction from one neighbor

使用 3 个相邻 MV 进行空间预测

spatial prediction using 3 neighboring MVs

增强的空间和时间预测

Enhanced spatial and temporal prediction

并行工具

Parallelism tools

切片

slices

切片和平铺

slices and tiles

波前并行处理, 瓷砖, 切片

Wavefront parallel processing, tiles, slices

参考图片

Reference pictures

2张参考图片

2 reference pictures

最多 16 个,具体取决于分辨率

up to 16 depending on resolutions

最多 16 个,具体取决于分辨率

up to 16 depending on resolutions

隔行扫描编码

Interlaced coding

支持场和帧编码。

Field and frame coding are supported.

支持场、帧和 MBAFF 模式。

Field, frame, and MBAFF modes are supported.

仅支持帧编码。

only frame coding is supported.

 

 

3.3 概括
  • 构成现代压缩架构的核心技术是在 1940 年代开始的数十年研究中开发出来的。
  • The core technologies that constitute modern compression architecture were developed over decades of research starting in the 1940s.
  • 推动视频压缩系统目前形式的三个关键突破是 a) 信息论,b) 预测和 c) 变换。
  • The three key breakthroughs that propelled video compression systems in their present form are a) information theory, b) prediction, and c) transform.
  • 两个主要的标准机构,即 ISO/IEC 和 ITU-T,多年来发布了大部分视频编码标准。运动图像专家组 (MPEG) 一直是 ISO/IEC 标准的工作组,而视频编码专家组(VCEG) 一直是 ITU-T 下的工作组。
  • Two major standards bodies, namely, ISO/IEC and ITU-T, have, over the years, published the majority of video coding standards. The Motion Pictures Experts Group (MPEG) has been the working group for ISO/IEC standards and the Video Coding Experts Group (VCEG) has been the working group under the ITU-T.
  • MPEG 和 VCEG 合作共同制定了非常成功的视频标准,包括 MPEG2、H.264和 H.265
  • MPEG and VCEG have collaborated to jointly produce immensely successful video standards, including MPEG2, H.264, and H.265.
  • 谷歌在 2013 年发布了 VP9 作为开放视频编码标准。它的成功导致了一个名为 AOM 的联盟的形成,该联盟正在制定一个名为 AV1 的新标准
  • Google published VP9 as an open video coding standard in 2013. Its success has led to the formation of an alliance called AOM that is working on a new standard called AV1.
  • 联合视频专家组 (JVET) 正在研究 H.265 的后继版本,称为 VVC(多功能视频编解码器),目标是将比特率比 H.265 进一步降低 50%。
  • The Joint Video Experts Group (JVET) is working on the successor to H.265, called VVC (Versatile Video Codec), with a goal of decreasing the bit rate by a further 50% over H.265.
3.4 笔记
  1. Richardson I, Bhat A. 视频编码历史第 1 部分。Vcodex。https://www.vcodex.com/video-coding-history-part-1/。2018 年 9 月 21 日访问。
  2. Richardson I, Bhat A. Video coding history Part 1. Vcodex. https://www.vcodex.com/video-coding-history-part-1/. Accessed September 21, 2018.
  3. 香农行政长官。关于通讯的数学理论。Bell Syst Tech J. 1948;27(七月):379-423;(十月):623-656。https://goo.gl/dZbahv。2018 年 9 月 21 日访问。
  4. Shannon CE. A mathematical theory of communication. Bell Syst Tech J. 1948;27(Jul):379-423;(Oct):623-656. https://goo.gl/dZbahv. Accessed September 21, 2018.
  5. 霍夫曼检察官 一种构造最小冗余码的方法。处理 IRE。1952;40(9):1098-1101。https://goo.gl/eMYVd5。2018 年 9 月 21 日访问。
  6. Huffman DA. A method for the construction of minimum-redundancy codes. Proc IRE. 1952;40(9):1098-1101. https://goo.gl/eMYVd5. Accessed September 21, 2018.
  7. Witten IH、Neal RM、Cleary JG。用于数据压缩的算术编码。公共 ACM。1987;30(6):520-540。 https://goo.gl/gRrXaS。2018 年 9 月 21 日访问。
  8. Witten IH, Neal RM, Cleary JG. Arithmetic coding for data compression. Commun ACM. 1987;30(6):520-540. https://goo.gl/gRrXaS. Accessed September 21, 2018.
  9. 施罗德先生。图像差分信号变换编码_ _ _ 专利US3679821A。1972.
  10. Schroeder MR. Transform coding of image difference signals. Patent US3679821A. 1972.
  11. Netravali AN,Stuller JA。变换域中视频信号的运动估计和编码。专利US4245248A。1981.
  12. Netravali AN, Stuller JA. Motion estimation and encoding of video signals in the transform domain. Patent US4245248A. 1981.
  13. 艾哈迈德 N、Natarajan T、Rao KR。离散余弦变换。IEEE 跨计算。1974;23(1):90-93。https://dl.acm.org/citation.cfm?id=1309385 。2018 年 9 月 21 日访问.S98 2
  14. Ahmed N, Natarajan T, Rao KR. Discrete cosine transform. IEEE Trans Comput. 1974;23(1):90-93. https://dl.acm.org/citation.cfm?id=1309385. Accessed September 21, 2018.S982

 

 

 

 

 

 

 

 

 

 

 

 

 

 

第二部分

Part II

4个 视频编解码器架构

视频压缩(或视频编码)是将数字视频转换为占用较少容量的格式的过程,从而提高存储和传输效率。正如我们在第 2 章中所看到的,原始数字视频需要大量的比特,而压缩对于互联网视频流、数字电视、蓝光和 DVD 磁盘上的视频存储、视频聊天以及 FaceTime 等会议应用程序等应用至关重要和Skype。

Video compression (or video coding) is the process of converting digital video into a format that takes up less capacity, thereby becoming efficient to store and transmit. As we have seen in Chapter 2, raw digital video needs a considerable number of bits and compression is essential for applications such as internet video streaming, digital television, video storage on Blu-ray and DVD disks, video chats, and conferencing applications like FaceTime and Skype.

压缩视频涉及两个互补的部分。在发送端,编码器组件将输入的未压缩视频转换为压缩流。在接收端,有一个解码器组件接收压缩视频并将其转换回未压缩格式。

Compressing video involves two complementary components. At the transmitting end, an encoder component converts the input uncompressed video to a compressed stream. At the receiving end, there is a decoder component that receives the compressed video and converts it back into an uncompressed format.

“编解码器”这个词来源于两个词——编码和解码。

The word ‘codec’ is derived from the two words - encode and decode.

CODEC = 编码编码+解码编码

CODEC = Encode + Decode

很明显,可以通过多种方式压缩视频数据,因此标准化此过程变得很重要。标准化确保来自使用不同制造商产品的不同来源的编码视频可以跨其他制造商提供的产品和平台统一解码。例如,使用 iPhone 编码和传输的视频需要能够在 iPhone 和三星平板电脑上观看。来自 Netflix 或 YouTube 的流媒体视频需要能够在许多终端设备上观看。无需进一步强调,这种互操作性对于压缩技术的大规模采用至关重要。

Quite obviously, there are numerous ways in which video data can be compressed and it therefore becomes important to standardize this process. Standardization ensures that encoded video from different sources using products of different manufacturers can be decoded uniformly across products and platforms provided by other manufacturers. For example, video encoded and transmitted using iPhone needs to be viewable on an iPhone and on a Samsung tablet. Streamed video from Netflix or YouTube needs to be viewable on a host of end devices. It needs no further emphasis that this interoperability is critical to mass adoption of the compression technology.

所有现代视频编码标准,包括 H.264 、H.265和 VP9,都定义了压缩视频的比特流语法以及解码该语法以获得可显示视频的过程。这被称为规范部分的视频标准。视频标准包含所有编码工具及其使用限制,可在标准中用于对视频进行编码。但是,该标准并未指定对视频进行编码的过程。虽然这为个人、大学和公司提供了巨大的研究机会,以提出最佳编码方案,但它也确保了每个符合标准的编码比特流都可以被完全解码,并从兼容的解码器中产生相同的输出。下面的图 18 显示了编码和解码过程,解码器部分中的阴影部分突出显示了视频编码标准涵盖的规范部分。

All modern video coding standards, including H.264, H.265 and VP9, define a bitstream syntax for the compressed video along with a process to decode this syntax to get a displayable video. This is referred as the normative section of the video standard. The video standard encompasses all the coding tools, and restrictions on their use, that can be used in the standard to encode the video. The standard, however, does not specify a process to encode the video. While this provides immense opportunity for individuals, universities and companies for research to come up with the best possible encoding schemes, it also ensures that every single encoded bitstream adhering to the standard can be completely decoded and produce identical output from a compliant decoder. Figure 18, below, shows the encoding and decoding processes and the shaded portion in the decoder section highlights the normative part that is covered by video coding standards.

图像

图 18:编码和解码的过程。

Figure 18: The process of encoding and decoding.

4.1 混合视频编码架构

由于数字视频由一系列静止图像表示,因此视频压缩技术自然会采用框架来单独分析和压缩每张图片。为此,每个帧都被归类为帧内帧或帧间帧,并使用不同的技术来识别和消除空间和时间冗余以实现高效压缩。在本节的第一部分,我们将探讨如何对视频序列中的各个图片进行分类和分组以进行编码。本节的后半部分将介绍如何进一步分解帧本身以进行基于块的预测编码。

As digital video is represented by a sequence of still images, it’s natural that video compression technologies employ frameworks to analyze and compress every picture individually. To do this, every frame is categorized either as an intra or an inter frame and different techniques are used to identify and eliminate the spatial and temporal redundancies to achieve efficient compression. In the first part of this section, we shall explore how individual pictures in the video sequence are categorized and grouped together for encoding. The latter half of this section then goes into how the frames themselves are broken down further for predictive block-based encoding.

4.1.1 帧内编码

帧内编码仅使用当前帧中的信息,通过分析和消除像素之间的空间相关性来最小化帧大小。因此,帧内帧或I 帧是一个独立的帧,可以独立编码并相应地解码,而不依赖于其他帧。发现压缩比约为 1:10 的 I 帧并不少见,这意味着 I 帧的大小比其未压缩版本小 10 倍。但是,实际压缩可能因比特率、使用的编码工具以及用于编码视频的设置而异。

Intra frame encoding uses only the information in the current frame by analyzing and removing the spatial correlation amongst pixels to minimize the frame size. An intra frame, or I frame, is thus a self-contained frame that can be independently encoded and correspondingly decoded without any dependency on other frames. It’s not uncommon to find I frames with a compression ratio around 1:10, meaning that the size of an I frame is 10 times lower than its uncompressed version. However, the actual compression can vary depending on the bit rate, coding tools used, and settings used to encode the video.

由于 I 帧不会减少时间冗余,而只会利用同一帧内的像素冗余,因此使用这些帧的缺点是它们会消耗更多的比特。另一方面,由于它们的压缩比较低,它们不会产生很多伪影,因此可以作为出色的参考帧来编码后续的、时间预测的帧。帧内编码比帧间编码的计算成本更低,而且不需要将多个参考帧存储在内存中。

As I frames don't reduce temporal redundancies but only exploit pixel redundancies within the same frame, the drawback of using these frames is that they consume many more bits. On the other hand, owing to their lower compression ratio, they do not generate many artifacts and hence serve as excellent reference frames to encode subsequent, temporally predicted frames. Intra frame encoding is less computationally expensive than inter frame encoding and also doesn't require multiple reference frames to be stored in memory.

视频序列中的第一帧被编码为 I 帧。在实时视频传输中,这些被用作新客户端(观众)加入流的起点。当解码器在比特流中遇到传输错误时,它们也用作重新同步点。在 DVD 和蓝光光盘等压缩存储设备中,I 帧用于实现随机访问点和快进和倒带等技巧模式。

The first frame in a video sequence is encoded as an I frame. In live video transmission, these are used as starting points for newer clients (viewers) to join the stream. They are also used as points of resynchronization when the decoder encounters transmission errors in the bitstream. In compressed storage devices like DVD and Blu-Ray discs, I frames are used to implement random access points and trick modes like fast-forward and rewind.

4.1.2 帧间编码

在一个典型的视频序列中,单张图片被捕获并以一秒内 25、30 或最多 60 帧的典型速率回放。除非视觉序列的部分具有完整的场景变化或高速运动,否则视频的后续帧很可能是相似的,相同的对象或多或少位于相似的位置。例如,具有静态背景谈话头像的新闻或谈话节目就是如此。在这样的场景中,关注和分析图片之间的变化比分析实际图片本身更有效率。这就是帧间编码的全部内容。

In a typical video sequence, the individual pictures are captured and played back at typical rates of either 25, 30 or up to 60 frames in one second. Unless the section of the visual sequence has a complete scene change or high motion, it’s likely that subsequent frames of the video are similar, with the same objects in more or less similar positions. This is true, for example, of a news or a chat show that has talking heads with static background. In such scenes, it’s more efficient to focus on and analyze the changes between the pictures rather than analyzing the actual pictures themselves. That’s what inter frame encoding is all about.

除了分析当前帧之外,帧间压缩还分析来自相邻帧的信息,并使用差分编码来去除时间上的、画面间的冗余以实现压缩。在差分编码中,将帧与用作参考帧的较早编码帧进行比较。它们的像素值之间的差异,称为残差, 然后计算。只有残差被编码在比特流中。这可确保仅对那些相对于参考帧发生变化的像素进行编码。这样,编码和发送的数据量就大大减少了。很可能,如前所述,绝大多数场景在图片之间几乎没有变化(除非有场景变化或明显的运动)。因此,这种方法通常会导致编码帧所需的位数显着减少。

In addition to analyzing the current frame, inter frame compression also analyzes the information from neighboring frames and uses difference coding to remove the temporal, inter-picture redundancies to achieve compression. In difference coding, a frame is compared with an earlier encoded frame that is used as a reference frame. The difference between their pixel values, called residual, is then calculated. Only the residual is encoded in the bitstream. This ensures that only those pixels that have changed with respect to the reference frame are coded. In this way, the amount of data that is coded and sent is significantly reduced. Quite possibly, as noted earlier, the vast majority of the scene hardly changes between pictures (unless there is a scene change or significant motion). Thus, this method usually leads to a significant reduction in the number of bits needed to encode the frame.

帧间编码使用多种技术,如运动估计运动补偿编码从一帧到下一帧的变化。运动估计是一种分析不同帧并提供运动矢量的过程,运动矢量用于描述跨这些帧的各种对象的运动。通过结合这个运动矢量差异编码中的信息,编码效率显着提高,特别是当视频场景包含多个移动对象时。运动估计的过程将在第 6 章中进一步详细解释。

Inter frame encoding uses a variety of techniques like motion estimation and motion compensation to encode the changes from one frame to the next. Motion estimation is a process that analyzes different frames and provides motion vectors that are used to describe the motions of various objects across these frames. By incorporating this motion vector information in difference coding, coding efficiency is significantly improved, especially if the video scene contains several moving objects. The process of motion estimation is further explained in detail in chapter 6.

帧间编码可以采用以下两种类型的预测编码。

Inter frame encoding can employ the following two types of predictive coding.

4.1.2.1 预测编码帧(P 帧)

P 帧(预测帧间)编码使用视频序列中较早编码的帧作为参考,并根据参考帧中的像素对当前帧的像素值变化进行编码。如下图 19 所示,当前帧(P 帧)和参考帧(I 帧)具有相似性,因此使用 I 帧作为参考的预测有助于减少编码位数。

P frame (predictive inter frame) encoding uses frames that were encoded earlier in the video sequence as a reference and encodes the changes in pixel values of the current frame from the pixels in the reference frame. As illustrated in Figure 19, below, the current frame (P frame) and the reference frame (I frame) have similarities and thus prediction using the I frame as a reference helps in reducing the number of encoded bits.

参考帧通常是 I 或 P 帧,然而,较新的编解码器也使用其他帧,如B 帧(双向预测帧间帧)作为参考。虽然对先前帧的依赖有助于减少用于对帧间进行编码的位数,但它通常会增加对传输损失的敏感性。例如,如果存在导致 I 帧中比特丢失的传输错误,则与仅使用 I 帧编码的比特流相比,具有引用 I 帧的帧间帧的比特流将显示更多的视觉伪像。

The reference frames are usually I or P frames, however, newer codecs also use other frames like B frames (bidirectionally predictive inter frames) for reference. While reliance on the previous frames helps in reducing the number of bits used to encode the inter frames it usually increases the sensitivity to transmission losses. As an example, if there’s a transmission error that results in loss of bits in an I frame, a bitstream with inter frames that reference the I frame will show more visual artifacts than if the bitstream were encoded with only I frames.

因此,AP 帧是用于定义前向预测的术语由运动信息和残差组成有效决定预测的数据。

A P frame is thus the term used to define forward prediction and consists of motion information and residual data that effectively determine the prediction.

4.1.2.2 双向预测编码帧(B 帧)

B 帧与 P 帧相似,不同之处在于它们引用和提取序列中时间上前后的帧的信息。因此,P 帧和 B 帧之间的区别在于它们使用的参考帧类型

B frames are similar to P frames, except that they reference and extract information from frames that are temporally before and later in the sequence. Thus, the difference between P frames and B frames lies in the type of reference frames they use.

图像

图 19:P 帧和 B 帧编码示意图。

Figure 19: Illustration of P frame and B frame encoding.

为了理解为什么需要这样做以及这种类型的预测提供的好处,让我们检查图 19 中显示的帧序列。如果序列中的第三帧被编码为仅使用前向预测的 P 帧,这将导致对球区域的预测更差,因为附加球不存在于前面的帧中。然而,可以通过使用后向预测来提高此类区域的压缩效率。这是因为可以从未来帧中预测第四个移动球的球区域。因此,如图所示,第三帧可以从 B 帧压缩中受益,因为它可以灵活地为包含前三个球的区域选择前向或后向预测以及最后一个球的任何未来帧。

To understand why this is needed and also the benefits that this type of prediction offers, let us examine the sequence of frames shown in Figure 19. If the third frame in the sequence were to be encoded as a P frame that uses only forward prediction, it would lead to poorer prediction for the ball region, as the additional ball is not present in the preceding frames. However, compression efficiency can be improved for such regions by using backward prediction. This is because the ball region can be predicted from the future frames that have the fourth moving ball in them. Thus, as shown in the figure, the third frame could benefit from B frame compression by having the flexibility to choose either forward or backward prediction for the region containing the top three balls and any of the future frames for the last ball.

为了能够根据未来帧进行预测,编码器必须以无序方式对未来帧进行编码,然后才能将其用作对 B 帧进行编码的参考。这需要在内存中有一个额外的缓冲区来临时存储 B 帧,先选择和编码未来的帧,然后再回来对存储的 B 帧进行编码。因为帧是按照它们编码的顺序发送的,所以在编码的比特流中,它们相对于源流是无序的。相应地,解码器将解码未来的帧首先,将其存储在内存中并用它来解码 B 帧。然而,解码器将依次显示 B 帧,然后是 I 帧。因此,由于B帧是基于将来要显示的帧,因此当解码器遇到B帧时,序列中帧的解码顺序和显示顺序会有所不同。

To be able to predict from future frames, the encoder has to, in an out-of-order fashion, encode the future frame before it can use it as a reference to encode the B frame. This requires an additional buffer in memory to temporarily store the B frame, pick and encode the future frame first and then come back to encode the stored B frame. Because the frames are sent in the order in which they are encoded, in the encoded bitstream, they are available out-of-order relative to the source stream. Correspondingly, the decoder would decode the future frame first, store it in memory and use it to decode the B frame. However, the decoder will sequentially display the B frame first, followed by the I frame. Consequently, as a B frame is based on a frame that will be displayed in the future, there will be a difference between decode order and display order of frames in the sequence when the decoder encounters a B frame.

图像

图 20:在输入中出现时按显示顺序排列的帧序列。

Figure 20: Sequence of frames in display order as they appear in the input.

图像

图 21:出现在比特流中时按编码/解码顺序排列的帧序列。

Figure 21: Sequence of frames in encode/decode order as they appear in the bitstream.

我们可以在上面的图 20 中清楚地看到这一点。它显示了视频序列中帧的显示顺序。这也是与输入中出现的相同帧序列。该显示编号或帧编号也显示在图中的帧中。在此示例中,使用了 I、P 和 B 帧编码,每个 P 帧有五个 B 帧。第 1 帧被编码为 I 帧,随后的未来帧(第 7 帧)被编码为 P 帧,因此五个 B 帧可以从中预测未来帧。在对第 7 帧进行编码之后,还对第 2 至 6 帧进行编码。这种模式周期性地重复,第 13 帧更早地编码为 P 帧,依此类推。

We can see this clearly in Figure 20, above. It shows the display order of frames in the video sequence. This is also the same sequence of frames as they appear in the input. This display number, or frame number, is also indicated in the frames in the figure. In this example, I, P, and B frame encoding is used and there are five B frames for every P frame. Frame number 1 is encoded as an I frame, followed by a future frame (frame number 7) as a P frame, so that the five B frames can have future frames from which they can predict. After frame 7 is encoded, frames 2 to 6 are also encoded. This pattern is repeated periodically, with frame 13 encoded earlier as a P frame, and so on.

图 21 显示了帧的编码(和相应的解码)顺序,因为它们出现在该特定示例的比特流中。在解码过程中,P 帧(帧 7、13、19)必须在在它们之前的 B 帧(帧 2-6、8-12、14-17)可以被解码。但是,P 帧将保存在缓冲区中,只有在显示 B 帧后才会显示。

The encoding (and corresponding decoding) order of the frames, as they appear in the bitstream for this specific example, is shown in Figure 21. During the decoding process the P frames (frames 7, 13, 19) have to be decoded before the B frames (frames 2-6, 8-12, 14-17) that precede them can be decoded. However, the P frames will be held in a buffer and displayed only after the B frame is displayed.

通常,B 帧使用比 P 帧少得多的编码位,并且序列中有许多为每个 P 帧编码的 B 帧。因此,AB 帧是用于定义前向预测的术语和后向预测,由运动向量和残差组成有效描述预测的数据。与 P 帧一样,对先前和未来帧的依赖有助于进一步减少用于编码 B 帧的位数。然而,它增加了对传输损耗的敏感性。此外,连续帧的累积预测和量化(将在第 7 章中解释的术语)过程增加了原始图片和重建图片之间的误差图片。这是因为量化是一个有损过程,并且根据有损版本的图片进行预测会导致原始图片和重建图片之间的误差增加。为此,早期的标准并没有使用 B 帧作为参考帧。然而,过滤和预测工具的增强提高了预测效率,从而使 B 帧可以用作 H.264 、H.265和 VP9 等较新标准中的参考帧。

Typically, B frames use far fewer coding bits than P frames and there are many B frames encoded for every P frame in the sequence. A B frame is thus the term used to define both forward prediction and backward prediction and consists of motion vectors and residual data that effectively describe the prediction. As with P frames, the reliance on the previous and future frames helps in further reducing the number of bits used to encode the B frames. However, it increases the sensitivity to transmission losses. Furthermore, the cumulative prediction and quantization (a term that will be explained in chapter 7) processes across successive frames increase the error between the original picture and the reconstructed picture. This is because quantization is a lossy process and prediction from a lossy version of the picture results in increased error between the original and reconstructed pictures. For this reason, earlier standards did not use B frames as reference frames. However, the enhancements in filtering and prediction tools have improved the prediction efficiency, thereby enabling the use of B frames as reference frames in newer standards like H.264, H.265, and VP9.

图像

图 22:不同帧类型的帧大小图示。

Figure 22: Illustration of frame size across different frame types.

图 22 中说明了这些帧类型中每一种的相对帧大小。该图是绘制使用 50 帧的 I 帧周期使用 I、P 和 B 帧类型编码的示例文件的文件大小的图表。在此图中,峰值是每 50 帧间隔出现的 I 帧的帧大小。这些以红色表示。下一个最大的帧大小是以蓝色表示的 P 帧,然后是以绿色显示的 B 帧。

The relative frame sizes by each of these frame types are illustrated in Figure 22. The figure is a graph plotting the file sizes of a sample file encoded with I, P, and B frame types using an I frame period of 50 frames. In this graph, the peaks are the frame sizes of I frames that occur at every interval of 50 frames. These are represented in red color. The next largest frame sizes are P frames represented in blue color followed by B frames which are shown in green color.

4.1.3 图片组 (GOP) 结构

每种帧类型在编码序列中都有其作用。虽然 I 帧消耗更多比特,但它们可作为比特流中的极好参考和访问点。P 和 B 帧利用它们的预测来提高压缩效率。大多数视频序列在很长一段时间内都具有相似的图像。通过策略性地周期性地穿插 I、P 和 B 帧,使得每个 I 帧之间有许多 P 和 B 帧,可以获得更高的压缩率,大约为 1000:1,同时保持可接受的视觉质量水平。

Every frame type has its role to play in an encoded sequence. While I frames consume more bits they serve as excellent references and access points in the bitstream. P and B frames leverage their predictions to crank up compression efficiency. Most video sequences have similar images for long periods of time. By strategically interspersing I, P and B frames periodically, such that there are many P and B frames between each I frame, it is possible to obtain dramatically higher compression, on the order of 1000:1, while maintaining acceptable visual quality levels.

编码流中 I、P 和 B 帧的周期性和结构化组织序列称为图片组 (GOP)。GOP 以 I 帧开始。这允许通过序列进行快速查找和随机访问。在遇到新的 GOP(I 帧)时,解码器了解它不需要来自先前帧的任何信息来进行进一步解码并重置其内部缓冲区。解码因此可以在 GOP 边界干净地开始,并且来自先前 GOP 的预测错误被纠正并且不会进一步传播。

The sequence of periodic and structured organization of I, P, and B frames in the encoded stream is called a group of pictures (GOP). A GOP starts with an I frame. This allows fast seek and random access through the sequence. Upon encountering a new GOP (I frame), the decoder understands that it doesn't need any information from previous frames for further decoding and resets its internal buffers. Decoding can thereby start cleanly at a GOP boundary and prediction errors from the previous GOP are corrected and not propagated further.

4.1.3.1 开放式和封闭式 GOP

对于 GOP 中的最后一张图片,编码器还可以选择使用后续 GOP 中的帧进行双向预测。通过这样做,更好的预测,因此,更好的视频质量可以实现。使用此机制生成的 GOP 称为开放式 GOP。或者,编码器可以选择不依赖来自后续 GOP 的帧,而是保持对同一 GOP 中包含的所有帧的预测。这种机制称为封闭 GOP。由于封闭的 GOP 是独立的,没有任何相邻的 GOP 依赖性,因此它对于帧精确编辑很有用,并且还可以作为比特流中的一个很好的拼接点。

For the last pictures in the GOP, the encoder also has the option to use frames from the subsequent GOP for bidirectional prediction. By doing so, better prediction and, hence, slightly better video quality can be achieved. The resulting GOP using this mechanism is called an open GOP. Alternatively, the encoder can choose not to rely on frames from subsequent GOPs and instead keep the prediction for all frames contained within the same GOP. This mechanism is called a closed GOP. As a closed GOP is self-contained without any neighboring GOP dependencies, it is useful for frame-accurate editing and also serves as a good splice point in the bitstream.

GOP 间隔以秒表示。1s GOP 表示编码器每隔 1 秒插入一个 I 帧。这意味着如果视频的帧速率为 30fps(比如 1080p30),编码器会为每 30 帧插入一个 I 帧。但是,如果视频帧率为 60fps,则它会每 60 帧插入一个 I 帧。固有的GOP结构通常由两个数字表示,即M和N。第一个数字M表示两个参考P帧之间的距离,第二个数字N表示实际的GOP大小。现代编解码器可能使用 B 帧而不是 M 的计数。例如,M=5,N=15,表示图 23 中所示的 GOP 结构。这里,每第 5 个帧是一个参考 P 帧,之后是一个 I 帧每 15 帧。

GOP intervals are expressed in seconds. A 1s GOP means the encoder inserts an I frame at every interval of 1 second. This means that if the frame rate of the video is 30fps (say 1080p30), the encoder inserts an I frame for every 30 frames. However, if the video frame rate is 60fps, then it inserts an I frame every 60 frames. The inherent GOP structure is often represented by two numbers, namely, M and N. The first number, M, refers to the distance between two reference P frames and the second number, N, refers to the actual GOP size. Modern codecs may use the count of B frames instead of M. As an example, M=5, N=15, indicates the GOP structure illustrated in Figure 23. Here, every 5th frame is a reference P frame and an I frame follows after every 15 frames.

BBBB P BBBB P BBBB | BBBB P BBBB P BBBB | ……

I BBBBP BBBBP BBBB | I BBBBP BBBBP BBBB | I

图 23:M=5 和 N=15 的 GOP 图示。

Figure 23: Illustration of GOP with M=5 and N=15.

现代编码器在从大量参考帧方案中进行选择时具有很大的灵活性,特别是因为 B 帧也可以用作对其他(B 或 P)帧进行编码的参考。这有助于显着改善压缩,但对错误传播很敏感。这意味着如果某些数据丢失,复杂的引用结构将用于传播引入的错误。H.264 中引入并在 H.265 中也支持的分层 B 参考方案(也称为 B 金字塔参考)提供了非常好的压缩效率,并且还可以限制错误传播。引用 B 帧中存在的层次结构有助于限制受数据损坏影响的图片数量。让我们探索现代编码器部署的一些典型 GOP 结构。

Modern encoders have great flexibility in choosing from among a host of schemes for reference frames, especially as B frames may also be used as references to code other (B or P) frames. This helps to significantly improve compression but is sensitive to error propagation. This means that if some data gets lost, the complex referencing structures will serve to propagate the errors introduced. Hierarchical B reference schemes (also called B-pyramid referencing) that were introduced in H.264 and also supported in H.265 provide very good compression efficiency and can also limit the error propagation. The hierarchy that exists in referencing B frames helps to limit the number of pictures affected by data corruption. Let us explore a few typical GOP structures that modern encoders deploy.

1. IBBBBP结构:这是经典的IPB GOP结构,没有任何层次B参考结构。每帧的参考图片、显示和样本编码/解码顺序在图 24 中进行了说明。序列号为 4 的 P 图片在第一个 I 图片之后编码。然后依次是三个 B 图片,依此类推。需要说明的是,B图片仍然是用来引用其他B图片的,没有任何层次。

1. IBBBBP Structure: This is the classic I-P-B GOP structure without any hierarchical B reference structures. The reference pictures, display and sample encode/decode order for every frame are explained in Figure 24. The P picture with sequence number 4 is encoded after the first I picture. This is then followed by the three B pictures in sequence, and so on. It should be noted that B pictures are still used for referencing other B pictures, without any hierarchy.

图像

图 24:P 和 B 参考系没有分层 B 预测。

Figure 24: P and B reference frames without hierarchical B prediction.

2. IBBBBBBBP 结构:该方案使用 7 个 B 帧,其中一些用作其他 P 和 B 帧的参考帧,如下图 25 所示。I/P 序列或显示顺序中的帧编号 0 和 8 首先编码,然后是帧 4 (B1)。第 4 帧使用 I 和 P 作为参考帧。B1 帧是层次结构中的第一层,也用作下一层 B 帧(即 B2)的参考。该方案进一步分层扩展到使用B2层中的B帧作为参考的B3层中的帧。

2. IBBBBBBBP Structure: This scheme uses 7 B frames, some of which are used as reference frames for other P and B frames as shown in Figure 25, below. Frame numbers 0 and 8 in the sequence or display order that are I/P are encoded first, followed by frame 4 (B1). Frame 4 uses I and P as reference frames. B1 frames are the first level in the hierarchy and are also used as references for the next level B frames, namely, B2. This scheme further extends hierarchically to frames in level B3 that use B frames in level B2 as reference.

图像

图 25:分层 B 参考框架.

Figure 25: Hierarchical B reference frames.

4.2 基于块的预测

H.265 和 VP9 等现代编解码器采用混合的、基于块的预测架构。这就是我们将在整本书中处理的编码器设计。这种混合视频编码器的方框图改编自 Sullivan、Ohm、Han 和 Wiegand [ 1],如图 26 所示。编码器使用原始视频序列作为输入来创建符合标准的编码比特流。

Modern codecs like H.265 and VP9 employ a hybrid, block-based prediction architecture. This is the encoder design that we will deal with throughout this book. The block diagram of such a hybrid video encoder, adapted from Sullivan, Ohm, Han, & Wiegand, [1] is illustrated in Figure 26. The encoder uses a raw video sequence as input to create a standard compliant encoded bitstream.

在此模型中,出于预测的目的,视频序列的每个图片都被归类为帧内图片或帧间图片。序列中的第一张图片被标记为帧内图片。该图片仅使用帧内预测,​​因此对任何其他图片没有编码依赖性。因为帧内图片比帧间图片消耗更多的比特,所以很少使用它们。它们被插入到比特流中的周期性间隔称为 I 帧间隔。正如我们所知,这也为视频序列提供了一个随机访问点。此外,在序列中的场景变化过程中,编码器通常具有内置的场景变化检测算法并在场景变化图片处插入I帧。其他图片被编码为帧间帧。这些使用来自相邻图片和块的时间预测。

In this model, each picture of the video sequence is categorized as an intra or inter picture for the purposes of prediction. The first picture in the sequence is marked as intra. This picture uses only intra prediction and thereby has no coding dependency on any other pictures. Because intra pictures consume more bits compared to inter pictures, they are used sparingly. The periodic interval in which they are inserted in to the bitstream is called the I frame interval. As we know, this also provides a random-access point into the video sequence. Furthermore, during scene changes in the sequence, the encoder usually has built-in scene change detection algorithms and inserts an I frame at the scene change picture. Other pictures are coded as inter frames. These use temporal prediction from neighboring pictures and blocks.

图像

图 26:基于块的编码器的框图。

Figure 26: Block diagram of a block-based encoder.

每个基于块的编码器的第一步是将每个帧分成块状区域。这些区域在不同的标准中有不同的名称。H.264标准使用称为宏块的 16x16 块,VP9 使用称为超级块的 64x64 块,H.265使用称为编码树单元(CTU)的各种方形块大小,范围从 64x64 到 16x16 像素。随着多年来对更高分辨率的需求不断增加,标准已经发展到支持更大的块大小以提高压缩效率。下一代编解码器 AV1也支持 128x128 像素的块大小。每个块通常在编码流水线的帧内按光栅顺序依次处理。

The first step in every block-based encoder is to split every frame into block-shaped regions. These regions are known by different names in different standards. H.264 standard uses 16x16 blocks called macroblocks, VP9 uses 64x64 blocks called superblocks and H.265 uses a variety of square block sizes called coded tree units (CTUs) that can range from 64x64 to 16x16 pixels. With the increased need for higher resolutions over the years, the standards have evolved to support larger block sizes for better compression efficiency. The next generation codec, AV1, also supports block sizes of 128x128 pixels. Every block in turn is usually processed in raster order within the frame in the encoding pipeline.

这些块以递归方式进一步分解,然后独立处理生成的子块以进行预测。图 27 显示了如何在 VP9 中实现递归分区。每个 64x64 超级块以 4 种模式中的任何一种进行分解,即 64x32 水平分割、32x64 垂直分割、32x32 水平和垂直分割模式,或者,无分割模式。

These blocks are further broken down in a recursive fashion and the resulting sub blocks are then processed independently for the prediction. Figure 27 shows how the recursive partitioning is implemented in VP9. Each 64x64 superblock is broken down in either of the 4 modes, namely, 64x32 horizontal split, 32x64 vertical split, 32x32 horizontal and vertical split mode, or, no split mode.

在 32x32 水平和垂直拆分模式下允许递归拆分。在这种模式下,每个 32x32 块都可以再次分解为 4 种模式中的任何一种,直到最小的分区大小为 4x4。这种分裂也称为四叉树分裂。图 27 还说明了 64x64 超级块如何在 VP9 中,分区以递归方式分解为不同的分区大小。

Recursive splitting is permitted in the 32x32 horizontal and vertical split mode. In this mode, each of the 32x32 blocks can be again broken down into any of the 4 modes and this continues until the smallest partition size is 4x4. This type of splitting is also called quadtree split. Figure 27 also illustrates how a 64x64 superblock in VP9 is broken down in a recursive manner to different partition sizes.

图像

图 27:一个 64x64 超级块在 VP9 中递归地划分为子分区。

Figure 27: A 64x64 superblock in VP9 is partitioned recursively in to sub partitions.

在最高级别,超级块使用拆分模式分解为四个 32x32 块。第一个 32x32 块处于无模式。这不是再次分解。第二个 32x32 块具有水平模式。这被分成两个 32x16 像素的分区,每个分区(表示为 2 和 3)。第三个 32x32 块处于拆分模式;因此,再次递归地分成四个 16x16 块。这些被进一步分解为 8x8 并进一步分解为 4x4 块。

At the highest level, the superblock is broken down using split mode to four 32x32 blocks. The first 32x32 block is in none mode. This is not broken down again. The second 32x32 block has horizontal mode. This is split into two partitions of 32x16 pixels, each (indicated as 2 and 3). The third 32x32 block is in split mode; hence, again, recursively broken in to four 16x16 blocks. These are further broken down all the way to 8x8 and further down to 4x4 blocks.

这种递归分区拆分机制也在图 28 中进行了说明。HEVC 中的分区方案非常相似,只有一些小的变化。现在我们了解了递归块子分区模式,让我们探讨为什么需要这种分区以及它有什么帮助。HEVC 和 VP9 等编码标准适用于从移动设备(例如 320x240 像素)到 UHD(3840x2160 像素)及更高的各种视频分辨率。视频场景复杂,同一画面中的不同区域可能与其他相邻区域相似。如果场景中有很多具有不同对象或纹理的细节,则较小的像素块可能与其他较小的相邻块相似,或者也与相邻图片中的其他相应块相似。

This mechanism of recursive partition split is also illustrated in Figure 28. The partition scheme in HEVC is very similar with a few minor variations. Now that we understand the recursive block sub partitioning schema, let us explore why this kind of partition is needed and how it helps. Coding standards like HEVC and VP9 address a variety of video resolutions from mobile (e.g., 320x240 pixels) to UHD (3840x2160 pixels) and beyond. Video scenes are complex and different areas in the same picture can be similar to other neighboring areas. If the scene has a lot of detail with different objects or texture, it’s likely that smaller blocks of pixels are similar to other, smaller, neighboring blocks or also to other corresponding blocks in neighboring pictures.

 

 

图像

图 28:从 64x64 块到 4x4 块的递归分区。

Figure 28: Recursive partitioning from 64x64 blocks down to 4x4 blocks.

因此,我们在图 29 中看到,具有精细叶子细节的区域具有较小的分区。使用子块级别的预测可以更好地利用较小分区区域中的像素间依赖性,以获得更好的压缩效率。然而,这种好处伴随着在比特流中用信号发送分区模式的成本增加。另一方面,平坦区域,如细节较少的较暗背景,即使有较大的分区也会有很好的预测。

Thus, we see in Figure 29, that the area with fine details of the leaves has smaller partitions. The interpixel dependencies in the smaller partition areas can be exploited better using prediction at the sub-block level to get better compression efficiency. This benefit, however, comes with increased cost in signaling the partition modes in the bitstream. Flat areas, on the other hand, like the darker backgrounds with few details, will have good prediction even with larger partitions.

图像

图 29:将图片划分为块和块的子分区。

Figure 29: Partition of a picture in to blocks and sub partition of blocks.

分析器来源:https://www.twoorioles.com/vp9-analyzer/

Analyzer source: https://www.twoorioles.com/vp9-analyzer/

下一个想到的问题是,这些分区是如何确定的?任何编码器面临的挑战是使用能够使用最少的位最好地实现像素编码并产生最佳视觉质量的分区。这通常是每个编码器独有的算法。编码器根据设定的标准评估各种分区组合,并选择一个它预计需要最少编码位的分区组合。关于为超级块做了什么分区的信息或 CTU 块也在比特流中发出信号。

The next question that comes to mind is, how are these partitions determined? The challenge to any encoder is to use the partition that best enables encoding of the pixels using the fewest bits and yields the best visual quality. This is usually an algorithm that is unique to every encoder. The encoder evaluates various partition combinations against set criteria and picks one that it expects will require the fewest encoding bits. The information on what partitioning is done for the superblock or CTU block is also signaled in the bitstream.

确定分区后,将在每个编码器中依次执行以下步骤。首先,每个块都经过预测过程以去除块中像素之间的相关性。预测可以在同一张图片内,也可以跨多张图片。这涉及找到最匹配的预测块,其像素值从当前块像素中减去以导出残差. 如果最佳预测块来自与当前块相同的图片,则将其归类为使用帧内预测。

Once the partitions are determined, the following steps are done in every encoder in sequence. At first, every block undergoes a prediction process to remove correlation among pixels in the block. The prediction can be either within the same picture or across several pictures. This involves finding the best matching prediction block whose pixel values are subtracted from the current block pixels to derive the residuals. If the best prediction block is from the same picture as the current block, then it is classified as using intra prediction.

否则,该块被分类为帧间预测块。帧间预测使用运动信息,它是运动矢量的组合(MV) 及其对应的参考图片。该运动信息和选择的预测模式数据在比特流中传输。如前所述,块以递归方式划分以获得最佳预测候选。可以指定预测参数,例如使用的预测模式、选择的参考帧和运动矢量,通常针对超级块中的每个 8x8 块.

Otherwise, the block is classified as an inter prediction block. Inter prediction uses motion information, which is a combination of motion vector (MV) and its corresponding reference picture. This motion information and the selected prediction mode data are transmitted in the bitstream. As explained earlier, blocks are partitioned in a recursive manner for the best prediction candidates. Prediction parameters like prediction mode used, reference frames chosen, and motion vectors can be specified, usually for each 8x8 block within the superblock.

原始块和生成的预测块之间的差异,称为残差block,然后使用空间变换进行变换处理。转换是一个接受残差值块并产生更有效表示的过程。据说经过变换的像素在变换域中,并且当我们水平向右和垂直向下遍历块时,系数集中在块的左上角周围,值减小。到此为止,整个过程是无损的。这意味着,给定变换后的块,可以使用逆变换和反向预测生成原始像素集。

The difference between the original block and the resulting prediction block, called the residual block, then undergoes transform processing using a spatial transform. Transform is a process that takes in the block of residual values and produces a more efficient representation. The transformed pixels are said to be in transform domain and the coefficients are concentrated around the top left corner of the block with reducing values as we traverse the block horizontally rightward and vertically downward. Until this point, the entire process is lossless. This means, given the transformed block, the original set of pixels can be generated by using an inverse transform and reverse prediction.

变换块然后经历称为量化的过程,该过程涉及 将块值除以固定数以减少残差的数量系数。然后可以使用扫描过程有效地排列这些,然后使用熵编码以产生二进制比特流编码方案。

The transform block then undergoes a process called quantization that involves dividing the block values by a fixed number to reduce the number of residual coefficients. These can then be efficiently arranged using a scanning process and then encoded to produce a binary bitstream using an entropy coding scheme.

当解码器接收到这个视频比特流时,它进行熵的互补过程解码、去扫描、去量化、逆变换和逆预测以产生解码的原始视频序列。当流中存在B帧时,图片的解码顺序(即码流顺序)与输出顺序(即显示顺序)不同。发生这种情况时,解码器必须在其内部存储器中缓冲图片,直到它们可以显示为止。

When the decoder receives this video bitstream, it carries out the complementary processes of entropy decoding, de-scanning, de-quantization, inverse transform and inverse prediction to produce the decoded raw video sequence. When B frames are present in the stream, the decoding order (that is, bitstream order) of pictures is different from the output order (that is, display order). When this happens, the decoder has to buffer the pictures in its internal memory until they can be displayed.

应该注意的是,编码器不将输入源材料用于其预测过程。相反,它有一个内置的解码器处理循环。这是必需的,以便编码器可以产生与解码器相同的预测结果,前提是解码器只能访问重建的来自编码材料的像素。为此,编码器执行逆缩放和逆变换以复制残差解码器将到达的信号。然后将残差添加到预测中并进行环路滤波以得到最终的重建图片。这被存储在解码图片缓冲器中用于后续预测。这与解码器的处理和输出完全匹配,并防止了编码器和解码器之间的任何像素值漂移。

It should be noted that the encoder does not employ the input source material for its prediction process. Instead, it has a built-in decoder processing loop. This is needed so that the encoder can produce a prediction result that is identical to that of the decoder, given that the decoder has access only to the reconstructed pixels derived from the encoded material. To this end, the encoder performs inverse scaling and inverse transform to duplicate the residual signal that the decoder would arrive at. The residual is then added to the prediction and loop-filtered to arrive at the final reconstructed picture. This is stored in the decoded picture buffer for subsequent prediction. This exactly matches the process and output of the decoder and prevents any pixel value drift between the encoder and the decoder.

4.3 切片和瓷砖

在前面的部分中,我们已经看到了如何将帧拆分为连续的方形像素块(称为超级块或 CTU)以进行处理。现代编解码器也采用slicetiles。这些将帧分成包含一个或多个超级块或 CTU 的区域。在 H.264和 H.265中,切片和切片用于将帧分成多个单元,这些单元以光栅顺序(切片)或非光栅顺序(切片)独立处理,以便这些独立单元的编码/解码可以并行发生。这用于加速并行架构和多线程环境中的计算。VP9 和 AV1 也支持磁贴标准,是一个计算友好的工具集,特别是对于软件、基于 CPU 的编码器和解码器实现。

We have seen in earlier sections how a frame can be split in to contiguous square blocks of pixels called superblocks or CTUs for processing. Modern codecs also employ slices and tiles. These split the frame in to regions containing one or more superblocks or CTUs. In H.264 and H.265, slices and tiles are used to split the frame into multiple units that are independently processed either in raster order (slices) or non-raster order (tiles) so that the encode/decode of these independent units can happen in parallel. This serves to speed up the computations in parallel architectures and multi-threaded environments. Tiles are also supported in VP9 and AV1 standards and are a computation-friendly toolset, especially for software, CPU-based encoder and decoder implementations.

切片工具集在 H.264和 H.265中可用。它们将每张图片分成独立的实体,其中包含按光栅顺序排列的帧块。在图 30 中,帧被分成三个切片,每个切片包含几个 CTU 块,每个切片都可以独立编码。CTU 在每个切片内按光栅顺序编码。这也允许切片独立解码,但去块除外过滤操作。在重建像素以进行进一步预测之前,跨多个切片边界允许这样做。

The slice tool set is available in H.264 and H.265. These split every picture into independent entities containing blocks of the frame in raster order. In Figure 30, the frame is split into three slices, each containing several CTU blocks and every slice can be independently encoded. The CTUs are encoded in raster order within each slice. This also allows the slices to be decoded independently, with the exception of deblocking filtering operation. This is permitted across multiple slice boundaries prior to reconstruction of the pixels for further prediction.

图像

图 30:一个视频帧被分成三个片段。

Figure 30: A video frame is split in to three slices.

VP9 支持图块,实现后,图片会沿着超级块断开界限。每个图块包含多个超级块,这些超级块均按光栅顺序处理,不允许灵活排序。但是,磁贴本身可以按任何顺序排列。这意味着图片中超级块的顺序取决于图块布局。这在下面的图 31 中进行了说明。

VP9 supports tiles and, when implemented, the picture is broken along superblock boundaries. Each tile contains multiple superblocks that are all processed in raster order and flexible ordering is not permitted. However, the tiles themselves can be in any order. This means that the ordering of the superblocks in the picture is dependent on the tile layout. This is illustrated in Figure 31 below.

图像

图 31:将一个框架拆分为 4 个独立的列块。

Figure 31: Splitting a frame in to 4 independent column tiles.

应该注意的是,分块和切片是旨在加速处理的并行性特征。它们不是质量改进函数。这意味着,为了实现并行操作,某些操作(如预测、上下文共享等)将不允许跨切片或切片。这是为了便于独立处理。这种限制也可能导致压缩效率有所降低。例如,VP9 施加了这样的限制,即列块之间不能存在编码依赖性。这意味着两个列块可以独立编码并因此解码。例如,在拆分为 4 个垂直块的帧中,如上图 31 所示,将没有运动矢量等编码相关性垂直瓷砖的预测。因此,软件解码器可以使用四个独立线程,每个线程解码一整列图块。瓦片大小信息在除最后一个瓦片之外的每个瓦片的图片头中传输。解码器实现可以使用此信息跳过并开始在单独的线程中解码下一个图块。编码器实现可以使用相同的方法并并行处理超级块。

It should be noted that tiles and slices are parallelism features intended to speed up processing. They are not quality improvement functions. This means that, in order to achieve parallel operations, some operations like predictions, context sharing, and so on would not be permitted across slices or tiles. This is to facilitate independent processing. Such limitations may also lead to some reduction in compression efficiency. As an example, VP9 imposes the restriction that there can be no coding dependencies across column tiles. This means that two column tiles can be independently coded and hence decoded. For example, in a frame split into 4 vertical tiles, as shown in Figure 31, above, there will be no coding dependencies like motion vector prediction across the vertical tiles. Software decoders can therefore use four independent threads, one each to decode one full column tile. The tile size information is transmitted in the picture header for every tile except the last one. Decoder implementations can use this information to skip ahead and start decoding the next tile in a separate thread. Encoder implementations can use the same approach and process superblocks in parallel.

4.4 隔行与逐行扫描

早些时候,当使用模拟技术表示视频时,它是通过在两个不同的时间扫描交替的行来捕获和显示的,以增强运动感知。这些行中的每一行都称为场生成的视频称为隔行扫描视频。这种表示视频的方法最近已经很少使用了。这是由当今的数字显示器在很大程度上支持渐进式格式这一事实所驱动的。随着互联网视频的广泛使用,H.265 等更新的标准和 VP9 期望渐进式内容。最近的标准中没有明确的编码工具来支持隔行编码。互联网上的视频流和传输都是渐进的,进一步减少了支持传统隔行扫描编码的需要。在本书中,我们还将专注于渐进式扫描编码。

Earlier when video was represented using analog technologies, it was captured and displayed by scanning alternate rows at two different times to enhance motion perception. Each of these rows was called a field and the resulting video was called interlaced video. This method of representing video has become lesser used in recent times. This is driven by the fact that digital displays today largely support progressive formats. With the widespread use of video over the internet, newer standards like H.265 and VP9 expect progressive content. There is no explicit coding tool in the recent standards to support interlaced coding. Video streaming and delivery over the internet is all progressive, further lessening the need to support the legacy interlaced coding. In this book, we will also focus exclusively on progressive scanned coding.

上述编码工具和框架有助于实现最佳压缩效率。这是任何视频压缩标准的目标。然而,标准也有其他目标的规定,例如使用分块和切片等工具集的计算简便性和并行性。从广义上讲,视频编码标准的设计者牢记以下目标。

The coding tools and frameworks described above help to achieve the best compression efficiency. This is the goal of any video compression standard. However, standards also have provisions for other goals like computational ease and parallelism using tool sets like tiles and slices. Broadly speaking, designers of video coding standards keep the following goals in mind.

压缩效率
高效并行实现,尤其是解码
容错能力
传输层集成

在接下来的章节中,我们将详细探讨有助于实现上述每个目标的编解码器设计的各种元素。

In the following chapters, we will explore in detail the various elements of the codec design that help in achieving each of the above goals.

4.5 概括
● 视频压缩涉及一个编码器组件,可将输入的未压缩视频转换为压缩流。在传输或存储之后,此流由互补解码器组件接收,该组件将其转换回未压缩格式。
● 视频压缩技术采用框架来分析和压缩每张图片,将它们分类为帧内帧或帧间帧。这些分别识别和消除空间和时间冗余。
● 帧间通过使用单向预测帧(P 帧)和双向预测帧(B 帧)提供显着压缩。
● 编码器使用编码流中 I、P 和 B 帧的一系列周期性和结构化组织。这称为图片组或 GOP。
● 流中使用 I 帧以允许通过序列进行快速查找和随机访问。
● 视频编码标准使用基于块的预测模型,其中每个帧又被分成不同大小的块状区域(例如 H.264 中的 16x16 宏块,H.265 中高达 64x64 CTU 或 VP9 中的 64x64 超级块。这些块被进一步分解和分区,以递归方式进行预测。
● 编码器使用 slice 和 tiles 将帧拆分为多个处理单元,以便这些独立单元的编码/解码可以并行进行。这些加速了并行架构和多线程环境中的计算。
4.6 笔记
  1. Sullivan GJ、Ohm J、Han W、Wiegand T. 高效视频编码 (HEVC) 标准概述。IEEE 跨电路系统视频技术。2012;22(12):1649-1668。https://ieeexplore.ieee.org/document/6316136/?part=1。2018 年 9 月 21 日访问。
  2. Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High Efficiency Video Coding (HEVC) standard. IEEE Trans Circuits Syst Video Technol. 2012;22(12):1649-1668. https://ieeexplore.ieee.org/document/6316136/?part=1. Accessed September 21, 2018.

 

 

5个 帧内预测

图像

帧内编码这一术语意味着预测和变换等压缩操作仅使用当前帧内的数据完成,而不使用视频流中的其他帧。

The term, intra frame coding, means that compression operations like prediction and transforms are done using data only within the current frame and not other frames in the video stream.

因此,帧内的每个块都使用仅帧内块进行编码。帧内编码的每个帧或块在时间上都是独立的,并且可以在不依赖于其他图片的情况下完全解码。仅使用空间预测的仅帧内块也可以与使用时间预测的其他帧间块一起出现在帧间帧中。在本章中,我们将探讨如何使用不同的编码工具进行空间预测。在本章末尾,我们还将比较这些工具在不同视频标准中的差异。

Every block in an intra frame is thus coded using intra-only blocks. Every frame or block that is intra coded is temporally independent and can be fully decoded without any dependency on other pictures. Intra-only blocks that use only spatial prediction can also be present in inter frames along with other inter blocks that use temporal prediction. In this chapter we will explore how spatial prediction is performed using different coding tools. Toward the end of the chapter, we will also compare how these tools differ across video standards.

5.1 预测过程

虽然 MPEG2 和 MPEG4-Part 2 等早期标准没有采用空间像素帧内预测,​​但自 H.264 以来,这已成为标准帧内编码的关键步骤。在本书中,我们将仅提及空间像素帧内预测。这种预测机制利用重建的相邻像素之间的相关性帧内的像素通过外推法从已经编码的像素中导出预测值。然后从当前像素中减去预测像素以获得可以有效编码的残差。因此,帧内预测的目标是找到最佳的预测像素集以最小化残差信息。它使用来自其紧邻的前一个空间邻居的像素,并通过用从相邻顶行或左列像素外推的像素填充整个块来导出预测像素。

While earlier standards like MPEG2 and MPEG4-Part 2 did not employ spatial pixel intra prediction, this has become a critical step in intra coding in standards since H.264. In this book, we will refer only to spatial pixel intra prediction. This mechanism of prediction exploits the correlation among neighboring pixels by using the reconstructed pixels within the frame to derive predicted values through extrapolation from already coded pixels. The predicted pixels are then subtracted from the current pixels to get residuals that can be efficiently coded. The objective of intra prediction, therefore, is to find the best set of predicted pixels to minimize the residual information. It uses the pixels from its immediately preceding spatial neighbors and derives the predicted pixels by filling whole blocks with pixels extrapolated from neighboring top rows or left columns of pixels.

图像

图 32:用于帧内预测的相邻块的图示。

Figure 32: Illustration of neighbor blocks used for intra prediction.

应该注意的是,预测块是使用先前编码和重构的块导出的(过滤前)。由于帧中的块通常以从左到右和从上到下的光栅方式处理,当前块的顶部和左侧块已经编码,因此可以利用这些可用像素来预测当前块。左侧和顶部像素集分别是当前块的高度和宽度的两倍,不同的编解码器限制使用来自该集的像素进行预测。例如,H.265使用所有左侧和左下像素,而 VP9 只允许使用左侧像素集。

It should be noted that the prediction block is derived using blocks that are previously encoded and reconstructed (before filtering). As blocks in a frame are usually processed in left to right and top to bottom raster fashion, the top and left blocks of the current block are already encoded and hence these available pixels can be leveraged to predict the current block. The left and top pixel sets are double the current block's height and width, respectively, and different codecs limit the usage of pixels from this set for prediction. For example, H.265 uses all the left and bottom left pixels whereas VP9 allows the use of only the left set of pixels.

图 32 说明了使用相邻块像素进行帧内预测的概念。图中的阴影块都是因果关系。这意味着它们已经被处理和编码(按扫描顺序)并且可以用于预测。对于要编码的当前块C,直接相邻块是左(L)、上(T)、左上(TL)和右上(TR)。该图的下半部分还显示了样本 4x4 亮度块和相应的亮度 4x4 相邻块的像素值。预测涉及来自这些最接近当前块的相邻块的样本。突出显示强调这些样本非常相似,并且与当前 4x4 块的亮度样本具有几乎相同的值。帧内预测通过找到可用于最佳预测的这些相邻像素中的最佳像素,恰好利用了这种像素冗余。这使得能够使用最少的位。现在我们知道了帧内预测的作用,让我们通过一个例子来探索它是如何完成的。

The concept of using neighboring block pixels for intra prediction is illustrated in Figure 32. The shaded blocks in the figure are all causal. This means that they have already been processed and coded (in scan order) and can be used for prediction. For current block C to be coded, the immediate neighbor blocks are left (L), top (T), top left (TL) and top right (TR). The bottom half of the figure also shows the pixel values for a sample 4x4 luma block and the corresponding luma 4x4 neighboring blocks. The prediction involves the samples from these neighboring blocks that are closest to the current block. The highlighting emphasizes that these samples are very similar and have almost the same values as the luma samples from the current 4x4 block. Intra prediction takes advantage of exactly this pixel redundancy by finding the best of these neighboring pixels that can be used for optimal prediction. This enables use of the fewest bits. Now that we know what intra prediction does, let us explore through an example how it is accomplished.

例子:让我们以图 32 所示的情况为例,从编码器的角度了解这些像素中的哪些最有助于预测给定 4x4 块中的当前像素。让我们试试左边的像素集,即左邻域 L 的最右边的列。在这种情况下,该列中的每个像素都水平复制,如图 33(a) 所示。另一种选择是使用顶部的像素集。在这种情况下,这将是顶部相邻块 T 的最底部行。在这种情况下,该行中的每个像素都垂直复制,如图 33(b) 所示。像素沿其他角度的投影也是可能的,图 33(c) 中显示了一个示例,其中像素从顶部 (T) 和右上角相邻 (TR) 块的最底部行沿 45-度对角线。

Example: Let us use the circumstances illustrated in Figure 32 as an example to understand, from an encoder’s perspective, which of these pixels best help to predict the current pixels in the given 4x4 block. Let’s try the left set of pixels meaning the right-most column of the left neighbor L. In this case, every pixel from this column is duplicated horizontally, as shown in Figure 33(a). Another option is to use the top set of pixels. In this case, this would be the bottom-most row of the top neighbor block T. In this scenario, every pixel from this row is duplicated vertically as shown in Figure 33(b). Projections of pixels along other angles are also possible and an example is shown in Figure 33(c) where the pixels are projected from the bottom-most row of the top (T) and top right neighbor (TR) block, along a 45-degree diagonal.

我们在上面的例子中看到了一些可能的预测候选。每个编码标准都定义了自己的允许预测候选或预测模式列表。现在编码器面临的挑战是为每个块选择一个。因此,编码器中的帧内预测过程涉及迭代标准允许的所有可能的相邻块和预测模式,以确定最佳预测模式和像素,以最大限度地减少所得残差的数量将被编码的位。

We have seen in the above example a few possible prediction candidates. Every encoding standard defines its own list of permissible prediction candidates or prediction modes. The challenge for the encoder now is to choose one for every block. The process of intra prediction in the encoder therefore involves iterating through all the possible neighbor blocks and prediction modes allowed by the standard to identify the best prediction mode and pixels for minimizing the number of resulting residual bits that will be encoded.

图像

图 33:帧内预测通过使用不同的方向模式从相邻块:(a)水平,(b)垂直,(c)对角线。

Figure 33: Intra prediction from neighbor blocks by using different directional modes: (a) horizontal, (b) vertical, (c) diagonal.

为此,编码器可以针对每种预测模式使用失真标准,例如绝对差的最小和(min-SAD)。这涉及简单的计算并指示残差中包含的能量。通过计算所有模式的 SAD,编码器可以选择具有最小 SAD 的预测块。SAD 在提供失真度量的同时,并未对生成的残差进行任何量化位,如果模式被选择。为了克服这个限制,现代编码器计算比特估计并使用它来推导成本函数。这是失真和比特估计的组合。最佳预测模式是具有最小成本函数的模式。

To do this, the encoder could, for each of the prediction modes, use a distortion criterion like minimal sum of absolute differences (min-SAD). This involves a simple computation and indicates the energy contained in the residuals. By computing the SAD for all the modes, the encoder can pick the prediction block that has the least SAD. The SAD, while providing a distortion metric, does not quantify anything about the resulting residual bits if the mode were chosen. To overcome this limitation, modern encoders compute a bit estimate and use it to derive a cost function. This is a combination of the distortion and bits estimate. The best prediction mode is the one that has the most minimal cost function.

在示例中,如图 33 所示,残差块包含更小的数字。这些比原始像素更便宜。来自三种方向模式的残差块的 SAD 分别为 3、1 和 7,如果使用 min-SAD 标准,这种情况下的编码器可以选择垂直预测模式作为最佳模式。

In the example, as illustrated in Figure 33, the residual blocks contain much smaller numbers. These are cheaper to represent than the original pixels. The SADs of the residual blocks from the three directional modes are 3, 1 and 7, respectively, and the encoder in this case could pick the vertical prediction mode as the best mode if it were using a min-SAD criterion.

这种允许的预测方向模式的数量和用于帧内预测的块大小在编解码器之间是不同的。更多的模式相当于增加了编码器的复杂性和更好的压缩效率。例如,H.264允许 16x16 宏块中的每个 4x4 或 8x8 块从定义的九种帧内模式中选择一种模式,它还为 16x16 亮度块提供一种模式。在 VP9 中,定义了十种预测模式。这些是针对 4x4 块计算的。

The number of such allowable prediction directional modes and block sizes for intra prediction are different across codecs. More modes amount to an increase in complexity in the encoder and better compression efficiency. For example, H.264 allows every 4x4 or 8x8 blocks within a 16x16 macroblock to select a mode from the defined nine intra modes and it also offers one mode for a 16x16 luma block. In VP9, ten prediction modes are defined. These are calculated for the 4x4 block.

图像

图 34:帧内预测H.265 中定义的角度模式.

Figure 34: Intra prediction angular modes defined in H.265.

另一方面, H.265进一步扩展它以使用 35 种模式来预测 CTU 内的 4x4、8x8、16x16 或 32x32 子块。35 种模式的预测方向如图 34 所示。H.265 中增加的角度经过设计,可以为近水平和近垂直角度提供更多选项,为近对角线角度提供更少选项,符合统计数据观察。即将推出的 AV1编解码器具有多达 65 种不同的模式。因此,H.265 和 AV1 必须具有更好的预测和视频质量改进。然而,相对于 H.264 等其他标准,它们的计算复杂性也显着增加。

H.265, on the other hand, extends it further to use 35 modes for prediction of 4x4, 8x8, 16x16 or 32x32 sub blocks within the CTU. The prediction direction of the 35 modes is shown in Figure 34. The increased angles in H.265 are designed such that they provide more options for near-horizontal and near-vertical angles and fewer options for near-diagonal angles, in accord with statistical observations. The upcoming AV1 codec has up to 65 different modes. It’s imperative, therefore, that H.265 and AV1 have better prediction and thereby video quality improvement. However, they also have significant computational complexity increase relative to other standards like H.264.

所有编解码器中最常见的预测模式都是定向的。这包括水平、垂直和其他角度方向、DC 预测器和任何其他标准特定的专用模式(例如,VP9 具有称为真运动或 TM 模式的专用模式)。通常,定向模式的数量因标准而异,正如我们上面所讨论的那样。

The most common prediction modes available across all codecs are directional. This includes horizontal, vertical and other angular directions, a DC predictor and any other standard-specific specialized modes (e.g., VP9 has a specialized mode called True Motion, or TM mode). Typically, the number of directional modes varies widely across standards, as we have discussed above.

虽然像素的确切数量和预测模式的数量在编解码器之间略有不同,但从概念上讲它们都是相同的,如上所述。

While the exact number of pixels and the number of prediction modes differ slightly across codecs, conceptually they are all the same, as explained above.

建立帧内预测模式后,以下内容将作为每个帧内预测块的编码比特流的一部分发送:

When the intra prediction modes are established, the following are then sent as part of the encoded bit stream for every intra predicted block:

a) 预测模式
b) 残差值(当前和预测像素的微分)

 

 

解码器在接收到比特流时执行互补操作。它使用帧内预测模式来查看已经解码的像素并形成预测像素。然后它将其添加到比特流的残差中以到达当前块的像素。

The decoder performs a complementary operation when it receives the bitstream. It uses the intra prediction mode to look at the already decoded pixels and form the predicted pixels. It then adds this to the residuals from the bitstream to arrive at the pixels of the current block.

5.2 变换块和帧内预测

图像

图 35:原始原始来源。

Figure 35: Original raw source.

以帧内模式编码的块的帧内预测是通过对块内所有较小的像素块连续进行预测来完成的。例如,对超级块内的 64 个 8x8 块中的每一个进行帧内预测预测过程在 H.264 /H.265和 VP9中类似,并且针对每个变换块完成。这意味着,如果变换块大小为 8x8 且块分区为 16x16,则整个分区将有一种预测模式,但帧内预测将为每个 8x8 块进行。

Intra prediction for a block that’s encoded in intra mode is done by successively doing prediction on all the smaller blocks of pixels within the block. For example, doing intra prediction on each of the sixty-four 8x8 blocks within a superblock. The prediction process is similar in H.264/H.265 and VP9 and is done for every transform block. This means that, if the transform block size is 8x8 and the block partition is 16x16, there will be one prediction mode for the entire partition but intra prediction will be done for every 8x8 block.

图像

图 36:由帧内预测像素形成的图像。

Figure 36: Image formed from intra predicted pixels.

但是,如果变换块大小为 32x32 且分区大小为 16x16(这在 H.265 中是允许的),则整个 32x32 块将仅共享一种帧内预测模式。应该注意的是,由于变换大小是方形的,帧内预测操作总是方形的。

However, if the transform block size is 32x32 and the partition size is 16x16 (this is permitted in H.265), there will be only one intra prediction mode shared by the entire 32x32 block. It should be noted that, as transform sizes are square, intra prediction operations are always square.

一些标准,如 VP9,允许对亮度和色度进行单独的帧内预测,​​在这种情况下,可以对色度重复该过程。然而, H.264和 H.265对色度使用相同的亮度帧内预测模式。图 35-37 说明了完整帧内帧的空间帧内预测的过程和效率。

Some standards, like VP9, allow separate intra prediction for luma and chroma in which case the process can be repeated for chroma. H.264 and H.265, however, use the same luma intra prediction mode for chroma. The figures 35-37, illustrate the process and efficiency of spatial intra prediction for a complete intra frame.

图 35 显示了原始源帧。图 36 显示了由预测像素形成的图像。请注意图 35 中预测像素与原始源的接近程度。最后,图 37 显示了残差值并通过查看此图像,我们可以推断出预测的准确性。

Figure 35 shows the raw source frame. Figure 36 shows the image formed by the predicted pixels. Notice how close the predicted pixels are to the original source in Figure 35. Finally, Figure 37 shows the residual values and by looking at this image, we can infer the accuracy of the prediction.

可以观察到,即使在天空等平坦区域的块级别,预测也相当准确,而在建筑物窗户和树木等细节处,预测中使用的块无法完全捕获这些细节。这会导致预测错误,最终将在比特流中编码为残差值。

It can be observed that the prediction is quite accurate even at a block level in flat areas like the sky, whereas in places of detail like the building windows and the trees, the blocks used in prediction are unable to completely capture these details. This results in prediction errors that will eventually be encoded in the bitstream as residual values.

图像

图 37:通过减去原始像素值和预测像素值形成的残差图像。

Figure 37: Residual image formed by subtracting original and predicted pixel values.

5.3 跨编解码器的比较

表 7:跨编解码器的帧内预测比较。

Table 7: Comparison of intra prediction across codecs.

特征

Feature

H.264

H.264

H.265

H.265

VP9

VP9

转换

Transforms

4x4 整数 DCT变换

4x4 integer DCT transforms

可变 32x32 到 4x4 整数 DCTtransforms + 4x4 整数 DST 变换

variable 32x32 down to 4x4 integer DCT transforms + 4x4 integer DST transform

可变 32x32 到 4x4 整数 DCTtransforms + 4x4 整数 DST 变换

variable 32x32 down to 4x4 integer DCT transforms + 4x4 integer DST transform

帧内预测

Intra prediction

9个方向的空间像素预测

spatial pixel prediction with 9 directions

35个方向的空间像素预测

spatial pixel prediction with 35 directions

空间方向预测的 8 个角度。

8 angles for spatial directional prediction.

如前所述,不同的视频编码标准具有相似的帧内预测原理,但在以下细节方面有所不同:

As highlighted earlier, different video coding standards have similar intra prediction principles but differ in the following fine details:

变换尺寸
帧内预测模式的数量和类型
处理亮度和色度像素的变化

表 7 突出显示了在一些视频标准中定义帧内预测过程的上述特征的差异。我们注意到,在较新的编解码器中使用了更多角度,以增加计算复杂性为代价来提高预测效率。

Table 7 highlights the differences in the above characteristics that define the intra prediction process in a few video standards. We notice that more angles are used in newer codecs to derive increased prediction efficiency at the expense of increased computational complexity.

5.4 概括
● 术语帧内编码意味着像预测和变换这样的压缩操作都是只使用当前帧内的数据而不是视频流中的其他帧来完成的。
● 帧内预测利用重建的相邻像素之间的相关性帧内的像素通过从已经编码的像素外推得出预测值。
● 分别来自左侧和顶部相邻块的最后一列和最后一行的重建像素通常用于预测当前块的像素。
● 编码器通过从几种方向模式和块大小中进行选择来选择要预测的像素。更多模式导致编码器的复杂性增加,但提供更好的压缩效率。
● 预测模式和残差每个帧内预测块的值在编码比特流中用信号表示。
● 为每个分区选择一种帧内预测模式,但为每个变换块完成预测过程。例如,如果变换大小为 8x8,分区为 16x16,则 16x16 分区将有一种预测模式,但将对每个 8x8 块执行帧内预测。

 

 

6个 国米预测

图像

帧间编码意味着预测和变换等压缩操作是使用视频流中当前帧及其相邻帧的数据完成的。帧间编码的每个帧或块都依赖于时间上相邻的帧(称为参考帧)。只有在它所依赖的帧被解码后,它才能被完全解码。与帧内预测不同,标准中的预定义模式定义块可用于预测的方向,帧间预测没有定义模式。因此,编码器可以灵活地在一个或多个参考帧中搜索广阔区域,以获得最佳预测匹配。超级块内的分区大小和形状也是灵活的,使得每个子分区都可以具有跨越不同参考帧的最佳匹配预测块。在帧内预测中,来自相邻块的一行或一列像素被复制以形成预测块。在帧间预测中,没有外推机制。相反,与最佳匹配相对应的参考帧中的整个像素块形成了预测块。

Inter frame coding implies that the compression operations like prediction and transforms are done using data from the current frame and its neighboring frames in the video stream. Every frame or block that is inter coded is dependent on temporally neighboring frames (called reference frames). It can be fully decoded only after the frames on which it depends are decoded. Unlike in intra prediction, where pre-defined modes in the standard define from which directions blocks may be used for prediction, inter prediction has no defined modes. Thus, the encoder has the flexibility to search a wide area in one or several reference frames to derive a best prediction match. The partition sizes and shapes within a superblock are also flexible such that every sub-partition could have its best matching prediction block spanning different reference frames. In intra prediction, a row or column of pixels from the neighboring blocks are duplicated to form the prediction block. In inter prediction, there is no extrapolation mechanism. Instead, the entire block of pixels from the reference frame that corresponds to the best match forms the prediction block.

6.1 基于运动的预测

由于场景中的不同物体可以以不同的速度移动,与帧速率无关,因此它们的实际运动位移不一定以整数像素为单位。这意味着使用像素粒度限制在参考帧中搜索块匹配会导致不完美的预测结果。使用亚像素粒度搜索可以给出更好匹配的预测块,从而提高压缩效率。那么自然的问题是,如果这些子像素实际上并不存在于参考帧中,如何导出这些子像素?编码器必须使用智能插值从相邻的全像素整数样本中导出这些子像素值的算法。此插值过程的详细信息将在本章后面的部分中介绍。现在,我们假设编码器使用这样的算法,并通过在参考帧中相应的整数像素之间插入像素值来搜索具有亚像素精度的匹配预测块。如果仍然证明良好的时间匹配块不可用或帧内预测产生更好的结果,则该块被编码为帧内。

As different objects in the scene can move at different speeds, independently of the frame rate, their actual motion displacements are not necessarily in integer pel units. This means that limiting the search for a block match in the reference frames using pixel granularity can lead to imperfect prediction results. Searching using sub-pixel granularity could give better matching prediction blocks, thereby improving compression efficiency. The natural question, then, is how to derive these sub-pixels, given that they don't actually exist in the reference frames? The encoder will have to use a smart interpolation algorithm to derive these sub-pel values from the neighboring full-pel integer samples. The details of this interpolation process are presented in later sections of this chapter. For now, we assume that the encoder uses such an algorithm and searches for matching prediction blocks with sub-pixel accuracy by interpolating pixel values between the corresponding integer-pixels in the reference frame. If it still turns out that good temporal matching blocks are unavailable or intra prediction yields better results, the block is coded as intra.

图像

图 38:导出运动矢量使用运动搜索。

Figure 38: Deriving the motion vector using motion search.

搜索参考帧以得出最佳匹配预测块的过程称为运动估计(ME)。当前块与其预测块的空间位移称为运动矢量 (MV) 并以 (X, Y) 像素坐标表示。图 38 说明了这个概念。显示了对包含球的样本块的运动搜索。在参考帧中,在定义的搜索区域中的其他块中,选择包含球的突出显示块,并将当前块和所选块之间的相对距离作为运动矢量进行跟踪。

The process of searching the reference frames to come up with the best matching prediction block is called motion estimation (ME). The spatial displacement of the current block from its prediction block is called motion vector (MV) and is expressed in (X, Y) pixel coordinates. Figure 38 illustrates this concept. The motion search for the sample block that contains a ball is shown. In the reference frame, among the other blocks in the defined search area, the highlighted block containing the ball is chosen and the relative distance between the current and the chosen block is tracked as the motion vector.

使用运动矢量形成预测块的机制称为运动补偿 (MC)。它是编码器中计算量最大的块之一。

The mechanism to use motion vectors to form the prediction block is called motion compensation (MC). It is one of the most computationally intensive blocks in an encoder.

使用运动搜索导出的预测块并不总是与当前块相同。因此,编码器计算残差通过从预测块的像素值中减去当前块的像素值来得到差异。这与运动矢量一起信息,然后在比特流中编码。这个想法是,更好的搜索算法会产生具有最少比特的残差块,从而提高压缩效率。如果 ME 算法找不到好的匹配项,残差将很大。在这种情况下,编码器评估的其他可能选项可能包括块的帧内预测或什至对块的原始像素进行编码。此外,由于帧间预测依赖于重建来自参考帧的像素,使用先前使用帧间预测编码的参考帧对连续帧进行编码通常会导致残留误差传播和质量逐渐降低。在比特流中每隔一定时间插入一个帧内帧。这些重置帧间预测过程以优雅地保持视频质量.

The prediction block derived using motion search is not always identical to the current block. The encoder, therefore, calculates the residual difference by subtracting the current block pixel values from those of the prediction block. This, along with the motion vector information, is then encoded in the bitstream. The idea is that a better search algorithm results in a residual block with minimal bits, resulting in better compression efficiency. If the ME algorithm can't find a good match, the residual error will be significant. In this case, other possible options evaluated by the encoder could include intra prediction of the block or even encoding the raw pixels of the block. Also, as inter prediction relies on reconstructed pixels from reference frames, encoding consecutive frames using reference frames that have previously been encoded using inter prediction often results in residual error propagation and gradual reduction in quality. An intra frame is inserted at intervals in the bitstream. These reset the inter prediction process to gracefully maintain video quality.

一旦运动矢量导出后,需要在比特流中用信号通知它,并且为每个分区块编码运动矢量的过程可能需要大量比特。由于相邻块的运动矢量通常是相关的,因此可以从附近的先前编码块的 MV 预测当前块的运动矢量。因此,使用另一种算法(通常是标准规范的一部分)计算差分 MV,并在比特流中与每个块一起用信号发送。

Once the motion vector is derived, it needs to be signaled in the bitstream and this process of encoding a motion vector for each partition block can take a significant number of bits. As motion vectors for neighboring blocks are often correlated, the motion vector for the current block can be predicted from the MVs of nearby, previously coded blocks. Thus, using another algorithm, which is often a part of the standard specification, the differential MVs are computed and signaled in the bitstream along with every block.

当解码器接收到比特流时,它可以使用差分运动矢量和相邻的 MV 预测器来计算块的 MV 的绝对值。使用此 MV,它可以构建预测块,然后将比特流中的残差添加到其中以重新创建块的像素。

When the decoder receives the bitstream, it can then use the differential motion vector and the neighboring MV predictors to calculate the absolute value of the MV for the block. Using this MV, it can build the prediction block and then add to it the residuals from the bitstream to recreate the pixels of the block.

图像

图 39: 输入的第 56 帧 - 斯德哥尔摩 720p YUV 序列。

Figure 39: Frame 56 of input - stockholm 720p YUV sequence.

图像

图 40: 运动补偿预测帧。

Figure 40: Motion compensated prediction frame.

图 39-42 逐步说明了如何使用基于块的帧间预测对视频帧进行编码和解码。图 39 显示了输入剪辑。其运动补偿预测帧如图 40 所示,对应的运动矢量如图 41 所示。图 42 显示了残差作为原始帧与其预测之间的差异而获得的帧。我们从图 40 中注意到预测帧与编码的实际帧在视觉上有多接近。这也在图 42 中客观地表示。它主要是灰色的,表示预测像素和实际像素之间具有很强相似性的区域。

The figures 39-42 illustrate step-by-step how a video frame is encoded and decoded using block-based inter prediction. Figure 39 shows the input clip. Its motion compensated prediction frame is shown in Figure 40 alongside the corresponding motion vectors in Figure 41. Figure 42 shows the residual frame that is obtained as a difference between the original frame and its prediction. We notice from Figure 40 how visually close the prediction frame is to the actual frame that is encoded. This is also objectively represented in Figure 42. It is mostly gray, representing areas of strong similarity between the predicted and actual pixels.

图像

图 41: 来自参考帧的运动矢量.

Figure 41: Motion vectors from reference frames.

图像

图 42: 运动补偿残差[1]

Figure 42: Motion compensated residual frame [1].

6.1.1 运动补偿预测

正如我们在前面部分中看到的,参考帧中的移动区域用于预测当前帧中的块。使用运动估计过程,位移运动矢量是派生的。它对应于所讨论块的参考帧和当前帧之间的运动偏移。

As we’ve seen in earlier sections, shifted areas in the reference frames are used for prediction of blocks in the current frame. Using the process of motion estimation, a displacement motion vector is derived. It corresponds to the motion shift between the reference and current frame for the block in question.

6.1.1.1 双向预测

图像

图 43:双向预测图示。

Figure 43: Illustration of bidirectional prediction.

可以使用一帧或两帧作为参考对每个块执行运动补偿预测。因此,一个块可能有多个来源,它可以根据这些来源进行预测,而这个数量取决于当前块是属于 P 帧还是 B 帧。如图 43 所示,在 P 帧的情况下,每个块只允许一个预测参考帧,而在 B 帧的情况下,每个块最多可以预测两个参考帧。H.264和 H.265等编码器维护两个单独的参考帧列表(列表 L0和列表 L1),其中 P 帧的任何块都可以使用列表 L0 中的参考帧进行预测。来自 B 帧的块可以使用 L0 或 L1 或同时使用 L0 和 L1 进行预测,其中可以使用来自每个 L0 和 L1 的一个参考帧。

Motion compensated prediction can be performed for every block using either one frame or two frames as reference. Thus, a block could potentially have multiple sources based on which it can predict, and this number depends on whether the current block belongs to a P frame or a B frame. As shown in Figure 43, in the case of P frames, only one prediction reference frame is allowed for every block whereas in the case of B frames, every block could predict up to two reference frames. Encoders like H.264 and H.265 maintain two separate reference frame lists (List L0 and List L1) where any block of a P frame can predict using reference frames from List L0. A block from B frame can predict using either L0 or L1 or both L0 and L1, where one reference frame from each L0 and L1 can be used.

在使用参考列表时,如何管理这些列表完全取决于编码器实现,包括在这些列表中添加了多少图片以及将哪些图片添加到每个列表中(只要它们在标准设置的配置文件/限制范围内) ). 可以将相同的图片添加到两个列表中,有趣的是,这可以用来模拟具有更高 ⅛ 像素精度的运动矢量(没有实际的比特流信号)。这可以通过使用两个相隔四分之一像素并指向同一参考图片的运动矢量来为任何块完成。

In using reference lists, it’s entirely up to the encoder implementation how these lists are managed, including how many pictures are added in these lists and what pictures get added to each list (as long as they are within the profile/limits set by the standard). The same picture can be added to both lists and, interestingly, this could be used to simulate motion vectors with higher ⅛ pixel precision (without actual bitstream signaling). This can be done for any block by using two motion vectors that are a quarter pixel apart and point to the same reference picture.

6.1.1.2 加权预测

基于块的帧间预测对帧之间的照明变化很敏感,尤其是在现场活动或淡入和淡出期间闪光灯引起的照明快速变化。这些可能会导致紧接的连续帧之间的强度发生显着变化,从而可能导致运动估计和补偿不佳。淡入淡出如图 44 所示。我们可以观察到强度在连续帧中变化一致但显着。

Block-based inter prediction is sensitive to illumination variations between frames, especially quick changes in illumination caused by flashes fired during live events or fades-ins and fade-outs. These can cause significant variation in intensity across immediately successive frames, which may result in poor motion estimation and compensation. Fades are illustrated in Figure 44. We can observe that the intensity varies consistently but dramatically across successive frames.

加权预测(WP) 是 H.264和 H.265中可用的工具。它通过在参考帧的预测过程中应用加权因子和偏移来帮助克服由于快速照明变化而带来的挑战。通过使用 WP,参考帧中的像素首先使用乘法因子 W 进行缩放,然后通过附加偏移量 O 进行移位。

Weighted prediction (WP) is a tool available in H.264 and H.265. It helps to overcome the challenges due to quick illumination changes by applying a weighting factor and offset during the process of prediction from the reference frame. By using WP, the pixels in the reference frame are first scaled using a multiplication factor, W, and then shifted by an additive offset, O.

运动补偿过程中,当前帧f curr和参考帧f ref之间的绝对差和(SAD)可以在数学上表示为:

During the process of motion compensation, the Sum of Absolute Differences (SAD) between the current frame fcurr and the reference frame fref can be mathematically expressed as:

SAD WP = m | f当前- F参考| 其中F参考= W * f参考+ O

SADWP = m | fcurr - Fref | where Fref = W * fref + O

图像

图 44:视频序列中的淡入淡出。

Figure 44: Fades in video sequences.

这里的挑战是导出正确的 W 和 O 值,然而,当找到时,可以相应地补偿 SAD,并且 ME 过程可以准确地得到正确的匹配预测块。应该指出的是,不同的情况会导致略有不同的方法。具有淡入和淡出的场景在整个图片中具有全局亮度变化。这可以用帧级 WP 参数相应地进行补偿。然而,带有相机闪光灯的场景在图片中会有局部光照变化,这显然需要更多局部 WP 参数来进行有效补偿。在从白色淡入的场景中也是如此,其中具有较亮像素的块的亮度变化小于较暗像素的块的亮度变化。然而,本地化的 WP 参数,和 H.265 。为了解决这个问题,已经开发了各种方法来有效地使用具有不同 WP 参数的多个参考帧的可用结构来补偿不均匀的亮度变化。

The challenge here is to derive the correct values of W and O, however, when found, the SAD can be correspondingly compensated for and the ME process can accurately get the correct matching prediction block. It should be noted that different situations would warrant slightly different approaches. Scenes with fade-in and fade-out have global brightness variations across the whole pictures. This can be compensated for correspondingly with frame level WP parameters. However, scenes with camera flash will have local illumination variations within the picture which will obviously require more localized WP parameters for efficient compensation. This is also true in scenes with fade-in from white where the brightness variation of blocks with lighter pixels is smaller than that of darker pixels. Localized WP parameters, however, will introduce excessive overhead bits in the encoded bitstream and are not available in H.264 and H.265. To combat this, various approaches have been developed that effectively use the available structures of multiple reference frames with different WP parameters to compensate for non-uniform brightness variations.

6.2 运动估计算法

在差分编码中,预测误差和用于编码的位数取决于从前一帧预测当前帧的效率。正如我们在前面几节中讨论的那样,如果涉及运动,则对序列中运动部分的预测涉及运动估计 (ME);也就是说,找出参考系中它被移动的位置。

In differential coding, the prediction error and the number of bits used for encoding depend on how effectively the current frame can be predicted from the previous frame. As we discussed in the earlier sections, if there’s motion involved, the prediction for the moving parts in the sequence involves motion estimation (ME); that is, finding out the position in the reference frame from where it has been moved.

图像

图 45:基于块的运动估计。

Figure 45: Block-based motion estimation.

图 45 说明了基于块的 ME 过程。在此过程中,最佳匹配是给出两个块之间绝对误差之和最小的块。由于一帧中的各种对象可以移动到下一帧中的任何位置,具体取决于它们的速度,因此在整个参考帧集合中搜索合适的预测器将需要大量计算。通常,在连续帧之间,特定对象的运动将被限制在垂直和水平方向上的几个像素。因此,运动估计算法定义了一个包围当前块的矩形区域,称为搜索区域进行搜索。搜索区域通常根据搜索范围指定. 这给出了用于搜索预测块的水平和垂直像素数。图 46 显示了 64x64 块周围的典型搜索区域,搜索范围为 +/-128 像素。

Figure 45 illustrates the block-based ME process. In this process, the best match is the block that gives the minimum sum of absolute error between the two blocks. As various objects in a frame could move to any position in the next frame, depending on their velocities, searching the entire set of reference frames for the suitable predictor would be very computationally intensive. Typically, between successive frames, the motion of a particular object would be restricted to a few pixels in both the vertical and horizontal directions. Hence, motion estimation algorithms define a rectangular area enclosing the current block called the search area to conduct the search. The search area is usually specified in terms of a search range. This gives the horizontal and vertical number of pixels to search for the predictor block. Figure 46 shows a typical search region around a 64x64 block with a search range of +/-128 pixels.

图像

图 46:在 64x64 块周围的范围为 +/- 128 像素的搜索区域。

Figure 46: Search area with range +/- 128 pixels around a 64x64 block.

期望在该搜索范围内的点之一中找到最佳运动搜索匹配. 但是,需要注意的是,这并不能保证。有时运动可能相当大,也超出了这个范围;例如,快速移动的运动场景。在这种情况下,最佳匹配可能是搜索区域中的运动点最接近实际运动矢量. 帧内预测如果没有合适的运动矢量匹配,也可以使用。根据在运动估计过程中搜索运动点的方式和内容,搜索算法可以分类如下:全搜索 穷举搜索算法采用蛮力方法,访问整个搜索范围内的所有点。计算所有这些点的 SAD 或类似指标,并将具有最小 SAD 的指标判定为预测变量。这是所有搜索技术中最好的,因为搜索范围内的每个点都经过仔细评估。穷举搜索的缺点是计算复杂度过高。

It is expected that the best motion search match is to be found in one of the points within this search range. However, it should be noted that this is not guaranteed. Sometimes the motion can be quite considerable and outside this range, as well; for example, a fast-moving sports scene. In such cases, the best match could be the motion point in the search area that is closest to the actual motion vector. Intra prediction can also be used if there’s no suitable motion vector match. Depending on how and what motion points are searched in the motion estimation process, the search algorithms can be classified as follows: Full search or exhaustive search algorithms employ a brute force approach in which all the points in the entire search range are visited. The SAD or similar metrics at all these points are calculated and the one with the minimum SAD is adjudged the predictor. This is the best of all the search techniques, as every point in the search range is evaluated meticulously. The downside of exhaustive search is its excessive computational complexity.

图像

图 47:运动估计的三步搜索。

Figure 47: Three step search for motion estimation.

智能搜索算法试图克服穷举搜索带来的繁重计算负荷通过不评估所有候选点的算法。相反,他们使用特定的搜索模式来显着减少搜索点的数量来找到运动矢量. 已经开发出许多快速搜索算法,它们的主要区别在于搜索点的评估模式。例如二维对数搜索(LOGS)、三步搜索(TSS)、菱形搜索(DS) 等。在本节中,我们将说明一个简单的三步对数搜索算法。如图 47 所示,其中一个示例搜索区域使用 +/- 7 像素。

Smart search algorithms try to overcome the heavy computational load imposed by exhaustive search algorithms by not evaluating all the candidate points. Instead, they use specific search patterns to dramatically reduce the number of search points to find the motion vector. Many fast search algorithms have been developed that differ primarily in the pattern in which the search points are evaluated. Examples are 2-D logarithmic search (LOGS), three-step search (TSS), diamond search (DS), and so on. In this section, we shall illustrate a simple 3-Step logarithmic search algorithm. It is illustrated in Figure 47, where a sample search area of +/- 7 pixels is used.

该算法中的搜索分三个独立的步骤执行,每个步骤从粗粒度区域进展到细粒度搜索区域. 这里的每个点代表一个搜索点。黑点代表第一步访问的点。它们间隔很宽,相隔 4 个像素。搜索正方形位置的九个这样的点,并选择具有最小成本的点作为该步骤的最佳点。通过将在步骤 1 中确定的最佳点保持为中心并评估其周围的 8 个正方形点来继续搜索。它们之间的距离仅为 2 个像素,如深灰色点所示。然后将第二步中的最佳点用作最终搜索周围 8 个方形点的中心。这些点彼此仅相距 1 个像素。然后选择第三步的最佳点作为最终的整数像素MV。这种方法通过仅使用 25 个搜索点提供良好的加速,对于相同的搜索区域。作为搜索范围变得更大,可以使用更大的方形图案,这种机制提供更高的加速比。

The search in this algorithm is performed in three separate steps and each step progresses from a coarse-grained area to a finer-grained search area. Every dot here represents a search point. The black dots represent the points visited in the first step. These are widely spaced at a distance of 4 pixels apart. Nine such points at square locations are searched and the one with the minimum cost is selected as the best point of this step. The search continues by keeping this best point identified in step 1 as the center and evaluating the 8 square points around it. These are only 2 pixels distance apart, as shown by the dark gray dots. The best point in this second step is then used as the center for the final search of 8 square points around it. These points are only 1 pixel away from each other. The best point of the third step is then chosen as the final integer pixel MV. This method provides good speedup by using only 25 search points, compared to 196 search points that would be used by exhaustive search for the same search area. As the search range becomes larger, larger square patterns can be used and this mechanism provides higher speedup ratios.

6.3 子像素插值

如前所述,运动矢量需要亚像素精度,因为视频场景中的不同对象可以以独立于帧速率的速度运动。这不能仅使用全像素运动矢量来准确捕获。在本节中,让我们看看如何计算这种亚像素精度的运动矢量。

As explained earlier, sub-pixel precision in motion vectors is needed as different objects in the video scene can have motion at speeds that are independent of the frame rate. This cannot be accurately captured using full pixel motion vectors alone. In this section, let us see how such sub-pixel accurate motion vectors are calculated.

图像

图 48:整数和亚像素预测示例。

Figure 48: Example of integer and sub-pixel prediction.

亮度和色度像素不在子像素位置采样。因此,参考图片中不存在这些精度的像素。因此,块匹配算法必须使用最近的整数像素的插值来创建它们,并且插值的准确性取决于整数像素的数量和插值过程中使用的滤波器权重。亚像素运动估计和补偿被发现比整数像素补偿提供更好的压缩性能,而 ¼ 像素优于 ½ 像素精度。虽然与整数像素 MV 相比,子像素 MV 需要更多的位来编码,但这种成本通常被更准确的预测所抵消,因此残差更少位。

Luma and chroma pixels are not sampled at sub-pixel positions. Thus, pixels at these precisions don't exist in the reference picture. Block matching algorithms therefore have to create them using interpolation from the nearest integer pixels and the accuracy of interpolation depends on the number of integer pixels and the filter weights that are used in the interpolation process. Sub-pixel motion estimation and compensation is found to provide significantly better compression performance than integer-pixel compensation and ¼-pixel is better than ½-pixel accuracy. While sub-pixel MVs require more bits to encode compared to integer-pixel MVs, this cost is usually offset by more accurate prediction and, hence, fewer residual bits.

图 48 说明了如何在两种情况下从参考帧预测 4x4 块:整数像素精度和小数像素精度 MV。在图 48a 中,灰点代表当前的 4x4 块。如图 48b 所示,当 MV 为整数 (1,1) 时,它指向与参考帧中随时可用的黑点对应的像素。因此,在这种情况下不需要插值计算。当 MV 是分数 (0.75, 0.5) 时,如图 48c 所示,它必须指向由较小灰点表示的像素位置。不幸的是,这些值不是参考帧的一部分,必须使用相邻像素的插值来计算。

Figure 48 illustrates how a 4x4 block could be predicted from the reference frame in two scenarios: integer-pixel accurate and fractional-pixel accurate MVs. In Figure 48a, the grey dots represent the current 4x4 block. When the MV is integral (1,1) as shown in Figure 48b, it points to the pixels corresponding to the black dots that are readily available in the reference frame. Hence, no interpolation computations are needed in this case. When the MV is fractional (0.75, 0.5), as shown in Figure 48c, it has to point to pixel locations as represented by the smaller gray dots. Unfortunately, these values are not part of the reference frame and have to be computed using interpolation from the neighboring pixels.

H.265对亮度和色度使用相同的 MV,并使用 ¼ 精确的 MV 来计算亮度,这些 MV 使用六抽头插值滤波器计算。对于 YUV 4:2:0,这些 MV 针对色度相应地缩放为 ⅛ 像素精确值。VP9 使用类似的插值,但使用更长的八阶滤波器和更精确的 ⅛ 像素插值模式。在 VP9 中,首先生成亮度半像素样本,然后使用八抽头加权滤波器从相邻的整数像素样本中进行插值。这意味着每个半像素样本是过滤器使用的 8 个相邻整数像素的加权和。完成半像素插值后,将使用半像素和全像素样本执行四分之一像素插值。      

H.265 uses the same MVs for luma and chroma and uses ¼ accurate MVs for luma that are computed using six-tap interpolation filters. For YUV 4:2:0, these MVs are scaled accordingly for chroma as ⅛ pixel accurate values. VP9 uses a similar interpolation but uses a longer eight-tap filter and also a more accurate ⅛-pixel interpolation mode. In VP9, the luma half-pixel samples are generated first and are interpolated from neighboring integer-pixel samples using an eight-tap weighting filter. This means that each half-pixel sample is a weighted sum of the 8 neighboring integer pixels used by the filter. Once half-pixel interpolation is complete, quarter-pixel interpolation is performed using both half and full-pixel samples.      

6.3.1 HEVC 中的子像素插值

在本节中,将使用 HEVC 插值滤波器作为示例详细说明分数插值过程。HEVC 中使用的整数和小数像素位置,改编自 Sullivan、Ohm、Han 和 Wiegand,[ 1] ,如图 49 所示。标记为 Ai,j 的位置表示亮度整数像素位置,而 a i,j , b i,j等是将通过插值导出的 1/2 像素和 1/4 像素位置。在 HEVC 中,半像素值是使用八抽头滤波器计算的。然而,四分之一像素值是使用七抽头滤波器计算的。现在让我们说明如何计算标记为 a 0,0到 r 0,0的所有 15 个位置。            

In this section, the fractional interpolation process is illustrated in detail using the HEVC interpolation filters as an example. The integer and fractional pixel positions used in HEVC, as adapted from Sullivan, Ohm, Han, & Wiegand, [1] are illustrated in Figure 49. The positions labeled Ai,j represent the luma integer pixel positions, whereas ai,j, bi,j and so on are the ½-pixel and ¼-pixel positions that will be derived by interpolation. In HEVC, the half-pixel values are computed using an eight-tap filter. However, the quarter-pixel values are computed using a seven-tap filter. Let us now illustrate how all the 15 positions marked a0,0 to r0,0 are computed.            

图像

图 49:HEVC 中亮度插值的像素位置。

Figure 49: Pixel positions for luma Interpolation in HEVC.

滤波器系数值在下面的表 8 中给出。下面给出了样本 b 0,j在半样本位置和 a 0,j在四分之一样本位置的示例。

The filter coefficient values are given in Table 8, below. An example for sample b0,j in half sample position and a0,j in quarter sample position is given below.

图像

表 8:HEVC 中使用的插值滤波器系数。

Table 8: Interpolation filter coefficients used in HEVC.

指数

Index

-3

-3

-2

-2

-1

-1

0

0

1个

1

2个

2

3个

3

4个

4

高频

HF

-1

-1

4个

4

-11

-11

40

40

40

40

-11

-11

4个

4

1个

1

资历架构

QF

-1

-1

4个

4

-10

-10

58

58

17

17

-5

-5

1个

1

 

 

标记为 e 0,0到 o 0,0的样本可以通过对上述计算样本应用相同的过滤器来导出,如下所示:

The samples labeled e0,0 to o0,0 can then be derived by applying the same filters to the above computed samples as follows:

图像

此时,如果启用,也可以应用加权预测。使用在编码器中发出信号的 WP 权重和偏移量对上面计算的预测值进行缩放和偏移。

At this point, weighted prediction can also be applied, if enabled. The prediction values computed above are scaled and offset using the WP weight and offset that are signaled in the encoder.

表 9:色度插值HEVC 中使用的滤波器系数。

Table 9: Chroma interpolation filter coefficients used in HEVC.

图像

色度分量的分数样本插值过程类似于亮度。但是,使用了仅具有 4 个滤波器抽头的不同插值滤波器。应该注意的是,在 4:2:0 的情况下,子采样色度的分数精度为 ⅛ 像素。用于 4:2:0 色度的八分之一样本位置(1/8、2/8、3/8 到 7/8)的四抽头滤波器如上表 9 所示。

The fractional sample interpolation process for the chroma components is similar to that for luma. However, a different interpolation filter with only 4 filter taps is used. It should be noted that the fractional accuracy is ⅛-pixel for subsampled chroma in the 4:2:0 case. The four-tap filters for eighth-sample positions (1/8th, 2/8th, 3/8th upto 7/8th) for 4:2:0 chroma are as given in Table 9, above.

6.4 运动矢量预测

图像

图 50:相邻块的运动向量高度相关[2]

Figure 50: Motion vectors of neighboring blocks are highly correlated [2].

对每个分区的运动矢量的绝对值进行编码会消耗大量比特。选择的分区越小,这种开销就越大。在低比特率场景中,开销也可能很大。幸运的是,如图 50 中突出显示的那样,相邻块的运动矢量通常是相似的。通过仅发送通过减去运动矢量获得的差分运动矢量,可以利用这种相关性来减少比特来自最佳邻居运动矢量的块。预测矢量 PMV 首先由相邻运动矢量形成。DMV,即当前 MV 和预测 MV 之间的差异,然后在比特流中进行编码。

Encoding absolute values of motion vectors for each partition can consume significant bits. The smaller the partitions chosen, the greater is this overhead. The overhead can also be significant in low bit rate scenarios. Fortunately, as highlighted in Figure 50, the motion vectors of neighboring blocks are usually similar. This correlation can be leveraged to reduce bits by signaling only the differential motion vectors obtained by subtracting the motion vector of the block from the best neighbor motion vector. A predicted vector, PMV, is first formed from the neighboring motion vectors. DMV, the difference between the current MV and the predicted MV, is then encoded in the bitstream.

现在的问题是,哪个相邻 MV 最适合预测任何块?不同的标准允许不同的机制来导出 PMV。它通常取决于块分区大小和相邻 MV 的可用性。HEVC 和 VP9 都具有增强的运动矢量 预测方法。即,较早编码的几个空间和时间相邻块的MV是被评估以选择为最佳PMV候选者的候选者。在 VP9 中,搜索来自空间和时间邻居的多达 8 个运动向量以得出 2 个候选者。第一个候选者使用空间邻居,而第二个候选列表由时间邻居组成。VP9 特别喜欢使用使用相同参考图片的候选者,并首先搜索这张图片。但是,如果较早的搜索未能产生足够的候选者,也会评估来自不同参考文献的候选者。如果仍然没有足够的预测变量 MV,则会推断并使用 0,0 向量。

The question now is, which neighboring MV is most suitable for prediction of any block? Different standards allow different mechanisms to derive the PMV. It usually depends on the block partition size and on the availability of neighboring MVs. Both HEVC and VP9 have an enhanced motion vector prediction approach. That is, MVs of several spatially and temporally neighboring blocks that have been coded earlier are candidates that are evaluated for selection as the best PMV candidate. In VP9, up to 8 motion vectors from both spatial and temporal neighbors are searched to arrive at 2 candidates. The first candidate uses spatial neighbors, whereas the second candidate list consists of temporal neighbors. VP9 specifically prefers to use candidates using the same reference picture and searches this picture first. However, candidates from different references are also evaluated if the earlier search fails to yield enough candidates. If there still aren't enough predictor MVs, then 0,0 vectors are inferred and used.

一旦这些运动矢量获得预测器 (PMV),它们用于使用 VP9 中可用的四种模式中的任何一种在比特流中用信号通知 DMV。四种模式中的三种对应于直接或合并模式。在这些模式中,不需要在比特流中发送运动矢量差分。基于信号帧间模式,解码器仅推断预测器 MV 并将其用作块的 MV。这些模式如下。

Once these motion vector predictors (PMVs) are obtained, they are used to signal the DMV in the bitstream using either of the four modes available in VP9. Three out of the four modes correspond to direct or merge modes. In these modes, no motion vector differential need be sent in the bitstream. Based on the signaled inter mode, the decoder just infers the predictor MV and uses it as the block’s MV. These modes are as follows.

● 最近的MV完整地使用第一个预测变量,没有增量。
● 近中压使用没有增量的第二个预测变量。
● 零压使用 0,0 作为 MV。

 

 

第四种模式,叫做New MV模式下,DMV 在比特流中明确发送。解码器读取这个运动矢量差并将其添加到最近的运动矢量以计算实际运动矢量。

In the fourth mode, called the New MV mode, the DMVs are explicitly sent in the bitstream. The decoder reads this motion vector difference and adds it to the nearest motion vector to compute the actual motion vector.

● 新MV使用预测列表的第一个预测器并向其添加 delta MV 以导出最终 MV。增量 MV 在比特流中编码。

H.265也使用与上述类似的机制,但术语和候选选择过程略有不同。在 H.265 中,任何 CTU 都允许使用以下模式。

H.265 also uses similar mechanisms as above with slightly different terminologies and candidate selection process. In H.265, the following modes are allowed for any CTU.

合并模式. 这类似于 VP9 中 PMV 的前三种模式,其中没有 DMV 在比特流中发送,解码器使用 PMV 候选集推断块的运动信息。标准中指定了如何为每个块得出特定 PMV 的算法。

Merge Mode. This is similar to the first three modes of PMV in VP9 where no DMV is sent in the bitstream and the decoder infers the motion information for the block using the set of PMV candidates. The algorithm on how to arrive at the specific PMV for every block is specified in the standard.

高级运动矢量预测. 与合并模式不同,在这种模式下,DMV 也在比特流中明确发出信号。然后将其添加到 PMV(使用类似于上述合并模式的过程导出)以导出块的 MV。

Advanced Motion Vector Prediction. Unlike the merge mode, in this mode the DMV is also explicitly signaled in the bitstream. This is then added to the PMV (derived using a process similar to the above for merge mode) to derive the MV for the block.

跳过模式. 这是一种独特的模式,当物体运动而照明没有任何显着变化时使用。虽然较早的标准将跳过模式定义为用于零运动的完美静态场景,但较新的编解码器(如 H.265)将其定义为包括运动。在 H.265 中,跳过模式语法标志在比特流中发出信号,如果启用,解码器使用相应的 PMV 候选作为运动向量以及参考帧中的相应像素,不添加任何残差。

Skip Mode. This is a unique mode that is used when there is motion of objects without any significant change in illumination. While earlier standards defined skip mode to be used in a perfectly static scenario with zero motion, newer codecs like H.265 defined it to include motion. In H.265, a skip mode syntax flag is signaled in the bitstream and, if enabled, the decoder uses the corresponding PMV candidate as the motion vector and the corresponding pixels in the reference frame as is, without adding any residuals.

6.5 概括
● 在帧间编码中,预测和变换等操作是使用视频流中当前帧和相邻帧的数据完成的。
● 搜索参考帧以得出最佳匹配预测块的过程称为运动估计 (ME),使用运动矢量 (MV) 形成预测块的机制称为运动补偿(MC)。
● 需要运动矢量的亚像素精度,因为视频场景中的不同对象可以以独立于帧速率的速度运动,因此不能使用全像素 MV 表示。
● 克服运动矢量信令开销,非常相似的相邻块的 MV 用于预测块 MV,并且仅产生残差MV 在比特流中用信号表示。
6.6 笔记
  1. Sullivan GJ、Ohm J、Han W、Wiegand T. 高效视频编码 (HEVC) 标准概述。IEEE 跨电路系统视频技术。2012;22(12):1649-1668。https://ieeexplore.ieee.org/document/6316136/?part=1。2018 年 9 月 21 日访问。
  2. Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High Efficiency Video Coding (HEVC) standard. IEEE Trans Circuits Syst Video Technol. 2012;22(12):1649-1668. https://ieeexplore.ieee.org/document/6316136/?part=1. Accessed September 21, 2018.
  3. VP9 分析仪。两只金莺。 https://www.twooroles.com/vp9-analyzer/。2018 年 9 月 22 日访问。
  4. VP9 Analyzer. Two Orioles. https://www.twoorioles.com/vp9-analyzer/. Accessed September 22, 2018.
剩余编码

 

 

图像

预测完成后,残差块值在成为编码比特流中的比特和字节之前会经历一系列过程。这些过程包括变换和量化阶段,本章将详细介绍这些过程。变换阶段接收预测后的残差值块,并将其转换到称为频域的不同域。它是同一组值,但在频域中表示不同。

After prediction is completed, the block of residual values undergoes a series of processes before they become bits and bytes in the encoded bitstreams. These processes include transforms and quantization stages, and these are covered in detail in this chapter. The transform stage takes in the block of residual values after prediction and converts it to a different domain called frequency domain. It’s the same set of values but represented differently in the frequency domain.

7.1 什么是频率?

简而言之,频率是指在特定时间段内重复某件事的速率。重复的次数越多,频率越高,反之亦然。因此,频率是变化时间段的倒数。这意味着从一个值更改为另一个值再返回所需的时间越短,该值在该时间段内出现的频率就越高。在图片的情况下,像素值的强度会发生变化,从一种强度变为另一种强度并再次返回所需的时间频率表示。强度从亮变暗再变暗的变化越快,表示图片的该部分所需的频率就越高。

Simply put, frequency refers to the rate at which something is repeated over a particular period of time. The more it’s repeated, the higher is the frequency and vice versa. Frequency is thus the inverse of the time period of the change. This means that the shorter the time it takes for the change from one value to another and back, the higher is the frequency of occurrence of that value over the time period. In the case of pictures, pixel values vary in intensities and the time it takes to change from one intensity to another and back again is represented by frequency. The faster the change of intensity from, say, light to dark and back, the higher the frequency needed to represent that part of the picture.

换句话说,图片中的频率只不过是变化率的表示。图片中快速变化的部分(例如边缘)包含高频,缓慢变化的部分(例如纯色和背景)包含低频。让我们看一个例子。假设图像像素块是黑色的。它不显示任何变化,这意味着它具有无限的变化间隔,因此频率很低。然而,现在,如果图像像素块在左侧为黑色,在中心变白然后再次变黑,则它会在间隔中发生一次变化,直到回到其原始值。这意味着它有一个有限的频率,比如说,一个。如果块有两个这样的变化,那么它有更高的频率 2 等等。因此,相关期间内值的变化次数越多,频率就越大。下面的图 51 中的 into tree 剪辑中的图像说明了这一点。天空区域的像素值变化很小,相应地,低频分量也很小。相比之下,树木具有明显的纹理,因此像素强度和高频分量的变化更大。

In other words, frequency in a picture is nothing but a representation of the rate of change. Rapidly changing parts of the picture (e.g., edges) contain high frequencies and slowly changing parts (e.g., solid colors and backgrounds) contain low frequencies. Let us look at an example. Say the block of image pixels is black. It doesn't display any change, meaning it has an infinite interval of change and thus low frequency. Now, however, if the block of image pixels is black at the left, turns white in the center and then turns black again, it then has one change in the interval until it’s back to its original value. This means it has a finite frequency of, say, one. If the block has two such changes, then it has a higher frequency of two and so on. Thus, the greater the number of changes in the values in the period in question, the greater is the frequency. This is illustrated in Figure 51 for an image from the into tree clip, below. The sky area has low variations in pixel values and, correspondingly, low frequency components. In contrast, the trees have significant texture, hence higher variations in pixel intensities and high frequency components.

图像

图 51:图像中高频和低频区域的图示。

Figure 51: Illustration of high frequency and low frequency areas in an image.

7.2 如何将图像分解为其频率?

在图像的每个像素块中,我们可以将像素的单个列和行近似为一系列频率的总和,从最低频率开始并添加更多频率。因此,像素块是一系列频率的并置。最低频率,实际上是块中像素的 DC 或平均值,根本不会添加任何精细 细节。随着每个频率的添加,一个接一个,图片中的细节越来越多。

Within every block of pixels in the image, we can approximate the individual column and row of pixels as the sum of a series of frequencies, starting with the lowest frequency and adding more frequencies. The block of pixels is thus a juxtaposition of a series of frequencies. The lowest frequency, in effect the DC or average value of pixels in the block, doesn't add any fine details at all. With every frequency added, one after another, more and more details are built up in the picture.

这里的基本思想是,像图像像素块这样的复杂信号可以分解为其构成频率分量的线性加权和。更高的频率分量代表更多的细节。

The basic idea here is that a complex signal like an image pixel block can be broken down into a linear weighted sum of its constituent frequency components. Higher frequency components represent more details.

7.2.1 为什么要转移到频域?

图像中的像素块具有不同频率的分量,变换过程用于将其表示为其构成频率分量的线性加权和。像边缘这样的高细节部分对应于高频分量。如前一节所述,平坦区域对应于低频分量。以这种方式分割图像为我们提供了某些优势,如下所示。

The pixel blocks in the image have components of varying frequencies and the transform process serves to represent it as a linear weighted sum of its constituent frequency components. The sections of high detail like edges correspond to high frequency components. The flat areas correspond to low frequency components, as described in the previous section. Splitting the image in this way affords us certain advantages, as follows.

7.2.1.1 能量压实

视频图片样本块表现出很强的空间相关性。这也扩展到残差,如图 52 中示例 32x32 块所示。我们在这里看到残差不仅相似,而且值更小,因此可以有效地表示。相似性和相关性更像是平坦过渡的同义词,而不是戏剧性的变化。这意味着像素中的能量或频率通常集中在相对于高频分量的低频分量周围。这种能量集中称为能量压缩这是为什么需要转换的一个关键原因。通过能量压缩,我们能够按照频率递增的顺序对这些分量进行分组,并获得一种模式,其中较高能量、低频分量出现在开始时,然后逐渐减少到较低能量、高频分量。这使我们能够非常有效地表示具有低得多的值的样本块。

Blocks of video picture samples exhibit strong spatial correlation. This also extends to residuals, as illustrated in Figure 52 for a sample 32x32 block. We see here the residuals are not only similar, but also smaller values that can therefore be efficiently represented. Similarities and correlation are more synonymous to flat transitions as opposed to dramatic changes. What this means is the energy or frequency in the pixels is usually concentrated around the low frequency components relative to the high frequency components. This concentration of energy is called energy compaction and it is a critical reason why transforms are needed. With energy compaction, we are able to group these components in the order of increasing frequencies and obtain a pattern wherein the higher energy, low frequency components occur at the start and gradually taper to the lower energy, high frequency components. This allows us a very efficient representation of the block of samples with much lower values.

图 53 显示了使用离散余弦变换对图 52 的 32x32 块进行变换后这将在后面的部分中描述。从这个图中,我们可以清楚地看到转换后的 32x32 块如何向高频分量表现出能量集中;也就是说,在转换后的块的左上角附近。当我们离开左上角区域时,我们看到许多小值和冗余零,它们可以在适当的重新排序过程后有效地表示。后续部分将详细介绍如何进行重新排序。现在,只需注意重新排序过程有助于通过使用能量模式有效地表示转换后的值。

Figure 53 shows the 32x32 block of Figure 52 after it is transformed using a discrete cosine transform that is described in a later section. From this figure, we clearly see how the transformed 32x32 block exhibits energy concentration toward the high frequency components; that is, around the top left of the block after the transformation. As we move away from the top left area, we see a lot of small values and redundant zeros that can be efficiently represented after a suitable reordering process. The details of how reordering is done are covered in subsequent sections. For now, it suffices to note that the reordering process helps in efficiently representing the transformed values by making use of the energy patterns.

图像

图 52:32x32 块的剩余样本值。

Figure 52: Residual sample values for a 32x32 block.

图像

图 53:转换后的能量压缩。

Figure 53: Energy compaction after transforms.

7.2.1.2 丢弃压缩

一旦我们进行了导致能量压缩的转换,我们就有能力分析这个数据集并更进一步有选择地丢弃一些数据。作为 HVS由于对均匀区域比对更详细的区域更敏感,我们可以关注低频分量并有选择地丢弃最高频率分量。虽然丢弃高频分量意味着丢失一些细节,但发现该过程以一定程度的平滑为代价产生了对原始块的良好近似。保留和丢弃的数据量取决于对所需细节的偏好和所需的压缩级别。此外,HVS 对亮度信息比对色度信息更敏感。这意味着我们可以对色度的处理比对亮度的处理稍微积极一些。丢弃数据的过程称为量化,将在接下来的部分中详细介绍。

Once we do the transform that results in energy compaction, we have the ability to analyze this data set and go one step further to selectively discard some of the data. As the HVS is more sensitive to uniform areas than to more detailed areas, we could focus on low frequency components and selectively discard the highest frequency components. Though discarding the high frequency components means losing some of the detail, this process is found to produce a good approximation of the original block at the expense of some level of smoothing. The amount of data kept and discarded depends on the appetite for detail needed and the level of compression required. Furthermore, the HVS is more sensitive to luma information than chroma. This means we can be slightly more aggressive with this process for chroma than for luma. The process of discarding the data is called quantization and is covered in detail in upcoming sections.

7.2.2 转换选择标准

选择转换时,以下是关键标准。

When selecting transforms, the following are key criteria.

7.2.2.1 能量压实

转换应该提供数据的完全去相关和最大能量压缩。我们在上一节中对此进行了解释。

The transform should provide a complete decorrelation of the data and maximum energy compaction. We have explained this in the previous section.

7.2.2.2 转换应该是可逆的

这意味着当解码器接收到这些变换后的系数时,它应该能够进行逆变换操作以准确地检索输入样本。应该注意的是,这个过程也被编码器使用,因为它执行解码器任务的这个子部分来存储重建的内部像素。

This means when the decoder receives these transformed coefficients, it should be able to do an inverse transform operation to retrieve the input samples accurately. It should be noted that this process is also used by the encoder as it performs this sub-section of the decoder tasks to store the reconstructed pixels internally.

7.2.2.3 转换应该易于实现

解码器实现的逆变换过程更是如此。由于存在具有大量功能的各种解码平台,因此视频可以通过最大数量的解码器以最小的复杂性要求进行解码是很有用的。常见的要求包括最小的内存存储要求、存储内部计算结果的较低算术精度和较少的算术运算。

This is truer of the process of inverse transforms that are implemented by the decoder. As a variety of decoding platforms with a host of capabilities exist, it’s useful for the video to be decodable by a maximum number of decoders with a minimal complexity requirement. Common requirements include minimal memory storage requirement, less arithmetic precision for storage of internal computation results and fewer arithmetic operations.

7.2.3 离散余弦变换

现在我们已经了解了什么是转换、为什么需要它们以及用于选择它们的标准,让我们探索一些流行的转换。多年来,在不同的视频和图像编码标准中提出并使用了多种变换,包括流行的离散余弦变换(离散余弦变换),离散正弦变换(DST) 和Hadamard 变换. 这些变换对图像块或残差进行操作值并且非常适合基于块的视频压缩框架。在本节中,我们将重点介绍自 MPEG2 视频标准时代以来被广泛采用的具有里程碑意义的 DCT,这要归功于其简单性和高能量压缩。所解释的概念在各种转换中都是基本的和相似的。

Now that we have seen what transforms are, why they are needed, and the criteria used for their selection, let us explore some popular ones. Over the years, several transforms have been proposed and used in different video and image coding standards, including the popular discrete cosine transform (DCT), discrete sine transform (DST) and Hadamard transform. These transforms operate on a block of image or residual values and fit very well in the block-based video compression framework. In this section, we will focus on the landmark DCT that has been widely employed since the era of the MPEG2 video standard, thanks to its simplicity and high energy compaction. The concepts explained are fundamental and similar across the various transforms.

双离合变速器将输入信号表示为不同频率和振幅的正弦波之和。它类似于离散傅立叶变换,但仅使用余弦函数和实数。在大多数视频标准中,一块残差值使用 4x4 或 8x8 整数变换进行变换,这是 DCT 的近似,这将是本书的重点。

DCT expresses the input signal as a sum of sinusoids of different frequencies and amplitudes. It is similar to the discrete Fourier transform but uses only cosine functions and real numbers. In most video standards, a block of residual values is transformed using a 4x4 or 8x8 integer transform that is an approximate of the DCT and it will be the focus of this book.

图像

以上是二维DCT的方程。虽然它看起来很复杂,但该方程可以很容易地理解如下:任何二维信号f也可以表示为F,它的变换对应物F是f的分量值与余弦基函数的加权和。该方程是二维的,因为它使用对应于uv的二维余弦基函数。选择的值 N 对应于残差的大小矩阵。在上面的等式中,当 N = 4 时,它会产生 2D 余弦变换系数的 4x4 矩阵:

The above is the equation of the two-dimensional DCT. While it looks complicated, the equation can be understood easily as follows: Any two-dimensional signal f can also be expressed as F, its transformed counterpart and F is the weighted sum of component values of f with cosine basis functions. The equation is two-dimensional, as it uses two-dimensional cosine basis functions corresponding to u and v. The value N is chosen corresponding to the size of the residual matrix. In the above equation, when N = 4 it yields the 4x4 matrix of 2D cosine transform coefficients as:

图像

双离合变速器一组 4x4 样本X由以下表达式给出,其中 Y 是变换块,A是 DCT 基组矩阵,A Tr是其矩阵转置。

The DCT of a set of 4x4 samples X is given by the expression below, where Y is the transformed block, A is the DCT basis set matrix, and ATr is its matrix transpose.

Y = AXA Tr

Y = A X ATr

图 54 展示了一组 8x8 的残差从 16x16 残差值样本块的左上角 8x8 块中获取的值。

Figure 54 shows a set of 8x8 residual values that is taken from the top left, 8x8 block of a sample 16x16 block of residual values.

图像

图 54:残差样本:左上角的 8x8 块。

Figure 54: Residual samples: top-left 8x8 block.

执行 8x8 DCT运算产生图 55 中所示的 8x8 矩阵。

Performing the 8x8 DCT operation yields the 8x8 matrix shown in Figure 55.

图像

图 55:8x8 DCT残差系数样品。

Figure 55: 8x8 DCT coefficients of the residual samples.

从图 55 中应该注意到,较大的系数紧凑地位于左上角周围,换言之,位于低频直流分量周围。这是变换操作所需的能量压缩函数。

It should be noted from Figure 55 that the larger coefficients are compactly located around the top left corner, in other words, around the low frequency DC component. This is the desired energy compaction function of the transform operation.

H.265和VP9根据变换大小定义预测模式,也使用几种变换的组合来适应不同的预测模式。VP9 支持的变换大小最大为预测块大小,或 32x32,以较小者为准。图 56 显示了斯德哥尔摩剪辑的屏幕截图,其中网格显示了变换大小。较大的变换尺寸(最大 32x32)用于天空和水等平滑区域,而较小的变换尺寸能够更好地捕捉建筑物等精细细节。编码器除了决定每个块的最佳预测模式外,还必须决定每个块的最佳变换大小。

H.265 and VP9 define prediction modes in accordance with the transform sizes and also use a combination of a few transforms to suit different prediction modes. VP9 supports transform sizes up to the prediction block size, or, 32x32, whichever is less. Figure 56 shows a screenshot from the stockholm clip with the grids showing the transform sizes. Larger transform sizes, up to 32x32, are used in the smooth areas like the sky and water, while smaller transform sizes are better able to capture the fine details like buildings and so on. Encoders, in addition to deciding the best prediction modes for every block, also have to decide the optimal transform size for every block.

图像

图 56:灵活使用不同的变换大小。

Figure 56: Flexible use of different transform sizes.

7.3 量化

图像

7.3.1 基本概念

量化是缩小一组数字范围的过程。范围是一个重要的考虑因素,因为它决定了值的数量,从而决定了表示每个值所需的位数。在缩小范围的情况下,新数字集中的数字因此可以使用更少的位来表示。这也意味着结果数字集中值的粒度也降低了。通常,应用的量化程度越高,生成的数据集就越粗糙。量化可以通过编码端的简单除法过程和接收或解码端的相应乘法来实现。让我们用一个例子来说明这是如何完成的。

Quantization is a process of reducing the range of a set of numbers. The range is an important consideration as it determines the number of values and hence the number of bits needed to represent every value. With a reduced range, the numbers in the new number set can thus be represented using fewer bits. This also means that the granularity of values in the resulting number set is also reduced. In general, the higher the quantization that is applied, the coarser is the resulting data set. Quantization can be achieved by a simple process of division at the encoding side and a corresponding multiplication at the reception or decoding side. Let’s illustrate how this is done using an example.

让我们考虑一组值范围为 0 到 255 的整数,如下图 57a 中的原始 4x4 矩阵所示。当这个集合中的数字除以一个固定值时,比如 4,得到的数字将只有 0 到 63 的范围,这意味着只有 64 个值。除以 4 并丢弃余数,假设这是一个整数集,我们在图 57b 中得到以下量化矩阵。当我们在接收端进行乘法的逆运算(乘以相同的值4)时,我们得到重构如下图 57c 所示的 4x4 矩阵。

Let’s consider a set of integers with a range of values from 0 to 255, as shown in the original 4x4 matrix in Figure 57a, below. When the numbers in this set are divided by a fixed value, say, 4, the resulting numbers will only have a range from 0 to 63, meaning only 64 values. Dividing by 4 and discarding the remainders, assuming this is an integer set, we get the following quantized matrix in Figure 57b. When we do the reverse operation of multiplication (by the same value 4) at the receiving end, we obtain the reconstructed 4x4 matrix as shown in Figure 57c below.

图像

图 57:量化过程.

Figure 57: Process of Quantization.

让我们在这里停留片刻,了解并记录我们对这个过程的观察。

Let’s stop here for a moment to understand and record our observations from this process.

  1. 与原始数字相比,量化后的数字更小。
  2. The quantized numbers are smaller compared to the original numbers.
  3. 几个数字都变成了零。
  4. Several numbers have become zeros.
  5. 量化值 4 控制结果数字集的性质。该值越高,观察 1 和 2 的上述效果就越明显。
  6. The quantizer value 4 controls the nature of the resulting number set. The higher this value, the more pronounced will be the above effects from observations 1 & 2.
  7. 在此过程中信息丢失,原始号码无法找回。
  8. Information was lost during this process and the original numbers were non-retrievable.

这些观察结果很重要,因为它们构成了在每个编码器中实现显着压缩的原理。在下一节中,我们将详细探讨这是如何完成的。

These observations are important as they form the principles by which significant compression is achieved in every encoder. In the following section we shall explore in detail how this is done.

7.3.2 量化矩阵

正如我们之前看到的,使用单个量化值很容易执行量化。也可以更进一步,使用一组 QP 值进行量化,通常每个频率分量一个值。这提供了利用 HVS 的能力. 与高频分量相比,它对低频分量的变化更敏感。然后我们可以优化和定制量化过程,这样我们就可以使用更高的量化值来丢弃更高的频率分量,并使用更低的量化值优先处理和保留低频分量。这组量化值称为量化矩阵 量化表。此外,我们可以为亮度和色度定义不同的矩阵,根据 HVS,与亮度相比,色度的量化值更高。

As we’ve seen earlier, it is easy to perform quantization using a single quantization value. It’s also possible to take it a step further and use a set of QP values to do quantization, typically one value for each of the frequency components. What this provides is the ability to leverage the HVS. It is more sensitive to changes in low frequency components when compared to the high frequency components. We can then optimize and customize the quantization process such that we employ higher quant values to discard higher frequency components and prioritize and preserve low frequency components using lower quantization values. This set of quantization values is called a quantization matrix or quantization table. Furthermore, we can define different matrices for luma and chroma with higher quant values for chroma compared to luma in accord with the HVS.

回到我们之前的原始号码集。让我们假设这个集合在开始时有低频分量,在结束时有高频分量,这在视频内容的转换块中是典型的。

Coming back to our earlier original number set. Let us assume this set has low frequency components at the start and higher frequency components toward the end as is typical in transformed blocks of video content.

现在让我们定义一个量化器集,如图 58 所示,而不是固定值。

Let us now define a quantizer set as shown in Figure 58 instead of a fixed value.

图像

图 58:量化矩阵。

Figure 58: Quantization matrix.

将图 59a 中复制的原始数字集除以该量化器集并丢弃余数(假设这是一个整数集),我们在 59b 中得到以下量化的 4x4 矩阵。

Dividing the original number set that is replicated in Figure 59a by this quantizer set and discarding the remainders assuming this is an integer set, we get the following quantized 4x4 matrix in 59b.

当我们在接收端进行乘法逆运算时,得到如下重构58c的矩阵。显然,重建的与固定量化值除法提供的结果相比,解码器端对应于低频分量的 4x4 矩阵左上角数字的值有所改善。虽然这些数字随着范围的扩大而变得更大,但这可以通过对系列中数字的更激进的量化值来补偿。

When we do the reverse operation of multiplication at the receiving end, we obtain the following reconstructed matrix of 58c. Clearly, the reconstructed values at the decoder end for the numbers at the top left of the 4x4 matrix, which correspond to the low frequency components, have improved compared to the results provided by a fixed quant value division. While these numbers have become bigger with a wider range, this can be compensated for by more aggressive quantization values for the numbers down the series.

图像

图 59:量化使用量化矩阵.

Figure 59: Quantization using a quantization matrix.

7.3.3 视频量化

现在我们已经了解了量化的机制,让我们探讨一下它是如何专门应用于视频压缩的。在压缩中,量化是残差块的下一步样本经过变换过程。我们之前讨论过,变换系数块的值集中在低频分量周围。我们还讨论了在量化过程中如何将具有较高范围的信号映射为具有缩小范围的信号,从而需要更少的比特来表示。当应用于视频的剩余样本时,量化因此用于降低剩余非零系数的精度。此外,它通常会给我们留下一个块,其中大部分或所有系数都为零。在视频编码标准中,有两个量化指标很重要:

Now that we have understood the mechanics of quantization, let us explore how this is applied specifically to video compression. In compression, quantization is the next step that a block of residual samples undergoes after the transform process. We discussed earlier that the block of transform coefficients has the values concentrated around the low frequency components. We’ve also discussed how, during quantization, signals with higher range are mapped to become signals with a reduced range that need fewer bits for representation. When applied to the residual samples of video, quantization thus serves to reduce the precision of the remaining non-zero coefficients. Furthermore, it usually leaves us with a block in which most or all coefficients are zero. In video coding standards, two quantization indicators are important:

  1. 量化参数(QP):这是应用层参数,可用于指定所需的量化级别。
  2. Quantization Parameter (QP): This is the application layer parameter that can be used to specify the level of quantization needed.
  3. 量化步长(Q step):这是对变换值进行量化的实际值。额外的缩放也可能会或可能不会与 Q分割过程一起应用。
  4. Quantization Step Size (Qstep): This is the actual value by which the transformed values are quantized by. Additional scaling may or may not be also applied along with the Qstep division process.

QP 和 Q step通常在数学上是相关的,一个可以从另一个推导出来。例如,在 H.264标准中,QP 值的范围从 0 到 51,并且 QP 和 Q step在数学上由等式链接:Q step = 2 ((QP-4)/6)在这种情况下,QP 值每增加 6 相应地使 Q step的值加倍。

QP and Qstep are usually mathematically related and one can be derived from the other. For example, in the H.264 standard, QP values range from 0 to 51 and QP and Qstep are mathematically linked by the equation: Qstep = 2((QP-4)/6). In this scenario, every increase in QP value by 6 correspondingly doubles the value of Qstep.

很明显,通过将 QP 设置为高值,Q步长相应增加,更多系数变为零,反之亦然。这意味着更高的 QP 导致需要处理的值更少,因此以牺牲视频质量为代价进行更多压缩. 相反,将 QP 设置为较低的值会给我们留下更多的非零系数,从而导致质量更高但压缩比更低。调高或调低量化参数有助于在帧内和跨多个帧的时间间隔内跨视频区域扩展比特分配。因此,编码器实现的挑战是获得合适的 QP 值。这些提供最高的压缩效率,同时尽可能保持最佳的视觉质量。因此,QP 成为调整图像质量的最关键参数。我们将在速率控制部分详细探讨这是如何完成的.

It then is obvious that by setting the QP to a high value, the Qstep is correspondingly increased and more coefficients become zero and vice versa. This means a higher QP results in lesser values to process and hence more compression at the expense of video quality. Conversely, setting the QP to a low value leaves us with more non-zero coefficients, resulting in higher quality but a lower compression ratio. Dialing up or down the quantization parameter helps in spreading the allocation of bits across areas of the video, both within the frame and within a time interval across several frames. The challenge, thus, for encoder implementations is to arrive at suitable QP values. These provide the highest compression efficiency while maintaining the best visual quality possible. The QP thus becomes the most critical parameter in tuning the picture quality. We will explore in detail how this is done in the section on rate control.

量化器应用于变换后的系数 (T coeff ),如下所示:

The quantizer is applied to the transformed coefficients (Tcoeff) as follows:

Q coeff = round (T coeff /Q step )

Qcoeff = round (Tcoeff/Qstep)

其中 Q step是量化器步长。

where Qstep is the quantizer step size.

作为例子,让我们看一个 16x16 的残差块使用 16x16 整数变换进行变换和量化的样本,如下图 60-62 所示。在图 62 中,较小的系数在量化块中变为零,非零值集中在对应于低频分量的左上角系数周围。此外,非零系数的范围缩小,现在允许我们用比原始信号更少的比特来表示量化块值。

As an example, let us look at a 16x16 block of residual samples that’s transformed using a 16x16 integer transform and quantized as shown in figures 60-62 below. In Figure 62, the smaller coefficients have become zero in the quantized block and the non-zero values are concentrated around the top-left coefficients that correspond to the low frequency components. Furthermore, the non-zero coefficients have a reduced range that now allows us to represent the quantized block values with fewer bits than the original signal.

图像

图 60:一个 16x16 的残差块预测后的值。

Figure 60: A 16x16 block of residual values after prediction.

应该注意的是,量化是一个不可逆的过程,这意味着没有办法从量化值中准确地重建信号输入。为了说明这个概念,现在让我们执行反向操作,就像解码器在接收到图 62 中的量化信号时所做的那样。

It should be noted that quantization is an irreversible process, meaning there is no way to exactly reconstruct the signal input from the quantized values. To illustrate this concept, let us now perform the reverse operations just as the decoder would do when it receives the quantized signal in Figure 62.

图像

图 61:经过 16x16 变换后的 16x16 块。

Figure 61: The 16x16 block after undergoing a 16x16 transform.

图像

图 62:经过量化后的 16x16 块。

Figure 62: The 16x16 block after undergoing quantization.

图 63 显示了逆量化过程后的 16x16 块。接下来是显示最终重建的图 64剩余的逆变换后解码器的值。虽然这些值清楚地显示了模式并且是图 60 中原始 16x16 残差块的合理近似值,但它远不及输入源的相同表示。

Figure 63 shows the 16x16 block after the process of inverse quantization. This is followed by Figure 64 that shows the final reconstructed residual values at the decoder after inverse transform. While these values clearly show patterns and are a fair approximation of the original 16x16 residual block in Figure 60, it’s nowhere near an identical representation of the input source.

上述操作是在更高的 QP 下执行的,超过 40。这引入了转换值的显着量化,如图 62 中所观察到的。现在让我们通过在 QP 30 左右执行相同的操作来说明此 QP 值如何影响上述结果并且也在 QP 20 左右。这在下面的图 65 和 66 中显示。

The above operations were performed at higher QP, upwards of 40. This introduces significant quantization of the transformed values, as we observe in Figure 62. Let us now illustrate how this QP value affects the results above by doing the same operations at around QP 30 and also at around QP 20. This is shown in Figures 65 and 66, below.

图像

图 63:逆量化后的 16x16 块。

Figure 63: The 16x16 block after inverse quantization.

图像

图 64:重建后的16x16 逆变换后的 16x16 块。

Figure 64: The reconstructed 16x16 block after inverse 16x16 transform.

正如我们在图 66 中看到的那样,QP 值较低,约为 20,重建的剩余的除了一些舍入差异外,这些值几乎与图 60 中的输入源残差值相同。

As we see in Figure 66, with lower QP values of around 20, the reconstructed residual values are almost identical to the input source residual values in Figure 60, except for some rounding differences.

这些示例建立了以下对视频工程师至关重要的观察结果。

These examples establish the following observations that are of paramount importance to a video engineer.

  1. 量化过程是不可逆的,量化过程后传输和接收的信号是原始信号源的近似有损版本。
  2. The quantization process is irreversible and the signal transmitted and received after quantization processes is an approximate, lossy version of the original source.
  3. 更高的量化会导致信号保真度的损失,从而导致更高的压缩。
  4. Higher quantization results in loss of signal fidelity and therefore higher compression.

量化值的控制是在保持信号保真度和实现高压缩比之间取得平衡的关键。

Control of quantization values is key to striking a balance between preserving signal fidelity and achieving a high compression ratio.

图像

图 65:重建后的QP 30 情况下 16x16 逆变换后的 16x16 块。

Figure 65: The reconstructed 16x16 block after inverse 16x16 transform in QP 30 case.

图像

图 66:重建后的QP 20 情况下 16x16 逆变换后的 16x16 块。

Figure 66: The reconstructed 16x16 block after inverse 16x16 transform in QP 20 case.

7.3.4 如何分配 QP 值?

量化既是一门艺术又是一门科学,因为它涉及分析丢弃信息的视觉效果。已经对应用主观量化矩阵以最高压缩效率获得最佳视觉体验进行了大量研究。探索了新颖的想法,其中可以在每个帧级别用信号发送唯一矩阵或 QP 值,对于亮度和色度不同,甚至用信号发送不同的矩阵,具体取决于 GOP 中的帧类型。有了帧级别的基本信号,也可以根据复杂性在图片帧内调整生成的 QP 值,平坦区域获得较低的 QP 调整,从而以复杂区域为代价保留细节。这在第 10 章的自适应量化部分进行了解释。

Quantization is as much of an art as a science, as it involves analyzing the visual effects of discarding information. Significant research has been done on applying subjective quantization matrices to get the best visual experience at the highest compression efficiency. Novel ideas are explored wherein a unique matrix or QP value can be signaled at every frame level, differently for luma and chroma, or even signal a different matrix, depending on frame type in the GOP. With basic signaling at the frame level in place, the resulting QP values can also be adjusted within the picture frame, depending on complexity, with flatter areas getting lower QP adjustments, thereby preserving details at the expense of complex areas. This is explained in the adaptive quantization section in chapter 10.

图 67 显示了一段使用不同量化值编码的视频示例。较低的 QP 值,例如图像左上角显示的 QP 30,可以更好地保留样本的保真度,我们可以轻松看到树叶的细节。然而,随着 QP 的增加,编码后的视频与源视频变得截然不同,从而导致量化块状伪影,如在使用 QP 50 编码的底部图像中观察到的那样。现代编码器具有内置的高级空间和时间自适应算法,可以分析场景内容计算并根据场景复杂度在块级优化分配 QP。这有助于避免细节区域中令人不快的块状效果,从而提供显着的视觉质量优势。

Figure 67 shows an example of a section of a video that’s encoded with different quantization values. The lower QP value, say, QP 30 as shown in the top left corner of the image preserves the fidelity of the samples better and we can easily see the details of the tree leaves. As the QP increases, however, the encoded video becomes dramatically different from the source, resulting in quantization blocky artifacts, as observed in the bottom image that is encoded with QP 50. Modern encoders have built-in advanced spatial and temporal adaptive algorithms that analyze the scene content to calculate and optimally allocate the QP at a block-level based on scene complexity. This helps to avoid unpleasant blocky effects in areas of detail and thereby provides significant visual quality benefits.

图像

图 67:量化过程的影响。

Figure 67: Effects of quantization process.

源视频:https://media.xiph.org/video/derf/

source video: https://media.xiph.org/video/derf/

7.4 重新排序

正如我们所见,量化后的变换系数有几个零系数,非零系数集中在左上角。与传输总是包括冗余零的所有值不同,传输非常少的系数值并将剩余值用信号表示为零变得有益。解码器在接收到比特流时,将能够导出非零系数,然后使用零信号将零添加到剩余值。

As we have seen, the quantized transform coefficients have several zero coefficients and the non-zero coefficients are concentrated around the top left corner. Instead of transmitting all the values, which invariably include redundant zeros, it becomes beneficial to transmit the very few coefficient values and signal the remaining values as zeros. The decoder, when it receives the bitstream, would be able to derive the non-zero coefficients and then use the zero signaling to add zeros to the remaining values.

4个

4

2个

2

2个

2

1个

1

-1

-1

1个

1

0

0

0

0

2个

2

2个

2

1个

1

-1

-1

0

0

0

0

0

0

0

0

-2

-2

1个

1

1个

1

1个

1

0

0

0

0

0

0

0

0

1个

1

1个

1

0

0

0

0

0

0

0

0

0

0

0

0

1个

1

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

图 68:8x8 量化系数块。

Figure 68: 8x8 block of quantized coefficients.

为了有效地做到这一点,有必要将非零系数全部打包在一起,并将零值单独打包在一起。最好的可能打包方式是向末尾打包最大数量的零(也称为尾随零) 。然后可以仅用信号通知一次。例如,让我们考虑如图 68 所示的 8x8 变换和量化系数。以简单的光栅顺序水平解析上述块会产生以下值流:

In order to do this efficiently, it’s necessary that the non-zero coefficients be all packed together and the zero values be packed together separately. The best possible packing is then the one in which a maximum number of zeros, also called trailing zeros, are packed toward the end. This can then be signaled just once. As an example, let us consider the 8x8 transformed and quantized coefficients as shown in Figure 68. Parsing the above block horizontally in a simple raster order produces the following stream of values:

[4 2 2 1 -1 1 0 0 2 2 1 -1 0 0 0 0 -2 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1(和三十一个 0)]

[4 2 2 1 -1 1 0 0 2 2 1 -1 0 0 0 0 -2 1 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 (and thirty-one 0s)]

通过水平或垂直扫描块,我们看到许多零仍然存在于非零系数之间,导致低效表示。这是因为 8x8 块的频率组织不是水平或垂直的,而是不同的模式。通过使用频率组织模式,我们能够按照频率递增的顺序组织系数。当使用此顺序时,属于低频分量的非零系数集中在开始处,随后是高频分量。这些通常在量化过程后为零。

By scanning the block horizontally or vertically, we see that a lot of zeros are still present in between non-zero coefficients, resulting in an inefficient representation. This because the frequency organization of the 8x8 block is not horizontal or vertical but instead a different pattern. By using the pattern of frequency organization, we are able to organize the coefficients in the order of increasing frequencies. When this order is used, the non-zero coefficients that pertain to the low frequency components are concentrated at the start, followed by high frequency components. These are usually zeros after quantization process.

频率组织模式称为之字形扫描,如下图 69 所示。当上次残图 68 中的块使用之字形扫描模式重新排序,从左上角开始并以之字形方式遍历块到右下角,它产生以下系数列表。

The pattern of frequency organization is called zig-zag scan and is illustrated in Figure 69, below. When the previous residual block in Figure 68 is reordered using the zigzag scan pattern, starting at the top-left and traversing the block in a zigzag manner to the bottom right, it produces the following list of coefficients.

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (和四十六个 0)]

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (and forty-six 0s)]

图像

图 69:8x8 块系数的之字形扫描顺序。

Figure 69: Zig-zag scanning order of coefficients of 8x8 block.

我们注意到尾随零的数量明显高于光栅扫描模式。结果是更有效的编码。然而,我们也注意到系数之间仍然存在零点,可以进一步优化。

We notice that the number of trailing zeros is significantly higher than the raster scan pattern. The result is more efficient encoding. However, we also notice that there are still zeros between the coefficients that could potentially be further optimized.

像图 69 中所示的经典锯齿形扫描那样扫描表,因此提供了更有效的系数解析。首先将所有非零系数组合在一起,然后是零系数。其他扫描表也是可能的,事实上 VP9 提供了一些不同的模式选项,这些选项或多或少地根据它们与左上角的距离来组织系数。默认的 VP9 扫描表显示了前几个系数的顺序,如下面的图 70 所示。

Scan tables like the classic zig-zag scan shown in Figure 69, thus, provide a much more efficient parsing of coefficients. All non-zero coefficients are grouped together first, followed by zero coefficients. Other scan tables are also possible and in fact VP9 provides a few different pattern options that organize coefficients more or less by their distance from the top left corner. The default VP9 scan tables, showing the order of the first few coefficients, is shown in Figure 70, below.

正如我们在下面看到的,当图 70 中所示的 VP9 扫描模式用于重新排序时,也可以有效地表示早期的 8x8 量化系数块。

As we see below, the earlier 8x8 block of quantized coefficients can be also efficiently represented when the VP9 scanning pattern shown in Figure 70 is used for reordering.

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (和四十六个 0)]

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (and forty-six 0s)]

应该注意的是,扫描模式与底层转换密切相关。它消除了残差之间的冗余采样并提供能量压实。由于 VP9 提供了 DCT 的灵活组合和用于水平和垂直变换的 ADST,它还提供灵活的扫描选项,可以与变换组合结合使用。

It should be noted that the scan pattern is very much tied to the underlying transform. It removes redundancies among residual samples and provides energy compaction. As VP9 provides flexible combinations of DCT and ADST for horizontal and vertical transforms, it also provides flexible scanning options which can be used in conjunction with the transform combinations.

图像

图 70:VP9 中 8x8 块系数的默认扫描顺序。

Figure 70: Default scanning order of coefficients of 8x8 block in VP9.

7.5 运行级对编码

运行级别对编码是一种有效地用信号通知量化块中的大量零值的技术。基本概念是,不是单独编码系数中存在的所有零,这会消耗比特,而是使用单个值在比特流中指示零的数量。具体而言,在比特流中用信号通知任何非零系数之前的前导零的数量。这可以节省一些位,尤其是在连续的非零系数之间有几个零的情况下。现在让我们探索如何使用这个概念分解早期的数组。

Run level pair encoding is a technique to efficiently signal the large number of zero values in the quantized block. The basic concept is that, instead of encoding all the zeros that exist among the coefficients individually, which consumes bits, the number of zeros is signaled in the bitstream using a single value. Specifically, the number of leading zeros before any non-zero coefficient is signaled in the bitstream. This can save some bits, especially if there are several zeros between successive non-zero coefficients. Let us now explore how the earlier array can be broken down using this concept.

运行级编码前的数字序列:

Sequence of numbers before run level encoding:

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1 , (和四十六个 0 ]

[4, 2, 2, -2, 2, 2, 1, 1, 1, 1, 1, 1, 1, -1, -1, 1, 0 ,1, (and forty-six 0s)]

运行级编码策略:

Run level encoding strategy:

[(零个 0 后跟 4),(零个 0 后跟 2),(零个 0 后跟 2),(零个 0 后跟 -2),(零个 0 后跟 2),(零个 0 后跟 2), (零个 0 后跟 1),(零个 0 后跟 1),(零个 0 后跟 1),(零个 0 后跟 1),(零个 0 后跟 1),(零个 0 后跟 1),(零0后跟1),(零个0后跟-1),(零个0后跟-1),(零个0后跟1),(一个0后跟1),四十六个0]

[(zero 0s followed by 4), (zero 0s followed by 2), (zero 0s followed by 2), (zero 0s followed by -2), (zero 0s followed by 2), (zero 0s followed by 2), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by 1), (zero 0s followed by -1), (zero 0s followed by -1), (zero 0s followed by 1), (one 0 followed by 1), forty-six 0s]

运行级别编码后的数字序列:

Sequence of numbers after run level encoding:

[(0,4)(0,2)(0,2)(0,-2)(0,2)(0,2)(0,1)(0,1)(0,1)(0, 1)(0,1)(0,1)(0,1)(0,-1)(0, 1)(0,1)(1,1)(块结束)]

[(0,4)(0,2)(0,2)(0,-2)(0,2)(0,2)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,1)(0,-1)(0, 1)(0,1)(1,1)(End of block)]

每对 ( run , level )中的第一个数字,即run,表示紧接在前面的零的数量,第二个数字,即level,表示非零系数值。块结束是一个特殊信号,可以在比特流中传递以指示不再有比特并且所有其他都是零。组织相同信息的另一种方法是为每个指示块结束的符号分配一个值。之前的数字序列将如下所示:

The first number in every (run, level) pair, namely, run, indicates the number of immediately preceding zeros and the second number, namely, level, indicates the non-zero coefficient value. The end of block is a special signal that can be communicated in the bitstream to indicate no more bits and all else are zeros. Another way of organizing the same information could be assigning a value to every symbol that indicates the end of block. The previous sequence of numbers would then look as follows:

[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0 )(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,-1,0 )(0,-1,0)(0,1,0)(1,1,1)]

[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,-1,0)(0,-1,0)(0,1,0)(1,1,1)]

请注意,最后一个值设置为 1,表示块结束。当解码器接收到这个流时,它会寻找最后的 1 位并将块的剩余值设置为零。

Notice that the last value is set to 1, indicating the end of block. When the decoder receives this stream, it looks for the last 1-bit and sets the remaining values of the block to zeros.

在下一章中,我们将看到如何将紧凑表示的量化信号编码为最终编码比特流中的比特和字节。

In the next chapter we shall see how the compactly represented, quantized signal gets encoded as bits and bytes in the final encoded bitstream.

7.6 概括
● 转换取一块残差像素值(预测后)并将它们转换为频域。这相当于以不同方式表示相同的值。
● 像素值的强度各不相同,从一种强度变为另一种强度并再次返回所需的时间由频率表示。强度从亮变暗再变暗的变化越快,表示图片的该部分所需的频率就越高。
● 一个图像像素块可以被分解成它的组成频率分量的线性加权和,更高的频率分量代表更多的细节。
● 转换提供能量压缩。这是他们选择的基本标准。
● 双离合变速器广泛用于视频编码标准,因为它提供了高度的能量压缩。
● 量化是缩小变换残差集合范围的过程值。它可以通过编码端的除法和解码端的相应乘法来实现。
● 量化是一个不可逆的过程。
● 更高的量化会导致信号保真度的损失。更高的压缩和量化值的控制是在保持信号保真度和实现高压缩比之间取得平衡的关键。

8个 熵编码

图像

在前面的章节中,我们探讨了如何去除像素间和像素内冗余以最小化需要编码的信息。我们还看到了使用变换、量化、扫描和运行级编码来有效表示结果残差的机制。以下信息必须作为块级比特流的一部分发送:运动矢量、残差、预测模式和过滤器设置信息。

In the previous chapters we explored how inter and intra pixel redundancies are removed to minimize the information that needs to be encoded. We also saw mechanisms to efficiently represent the resulting residuals using transforms, quantization, scanning and run-level coding. The following pieces of information have to be sent as part of the bitstream at a block level: motion vectors, residuals, prediction modes, and filter settings information.

在本章中,我们将详细研究如何通过最小化统计编码冗余来使用最少的比特对运行级别值进行编码。回想一下,在上一节中,示例编码块有以下运行级别对:

In this chapter, we will study in detail how the run-level values are encoded using the fewest bits by minimizing statistical coding redundancies. Recall that, in the previous section, we had the following run level pairs for the example encoding block:

[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0 )(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,-1,0 )(0,-1,0)(0,1,0)(1,1,1)]

[(0,4,0)(0,2,0)(0,2,0)(0,-2,0)(0,2,0)(0,2,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,1,0)(0,-1,0)(0,-1,0)(0,1,0)(1,1,1)]

将其编码为二进制比特流的最简单方法是了解此类符号所有可能值的范围,然后确定使用每个符号的固定位数对它们进行编码所需的位数。假设 last = 0 和 last = 1 有 100 个这样的符号,那么我们可以为每个符号使用 8 位创建一个唯一值。要对上述集合进行编码,我们需要 17 x 8 = 136 位或 17 个字节。

The easiest way to encode this to a binary bitstream would be to understand the range of all possible values of such symbols, then determine the number of bits needed to encode them with a fixed number of bits per symbol. Assuming there are 100 such symbols each for last = 0 and last = 1, then we can create a unique value using 8 bits for every symbol. To encode the above set we would then need 17 x 8 = 136 bits or 17 bytes.

这隐含地假设所有符号都具有相同的出现可能性,在这种情况下,为每个符号分配相同数量的比特是有意义的。然而,在现实中,包括视频和图像内容在内的数据很少有相同可能性的符号。相反,它们往往有一些比其他符号更频繁出现的符号。

This implicitly assumes that all the symbols have the same likelihood of occurrence, in which case assigning the same number of bits to each symbol makes sense. However, in reality, data including video and image content rarely have symbols that are equally likely. Instead, they tend to have some symbols that occur more frequently than others.

8.1 信息论概念                  

信息论帮助我们理解如何最好地发送具有不等似然符号的最少位的内容。这首先由克劳德香农在他的里程碑式论文中提出,在第 3 章中提到。香农清楚地表明了可以以无损方式压缩多少数据的限制。在本节中,我们试图直观地展示它是如何工作的,并从数学上解释这些概念是如何形成的。

Information theory helps us understand how best to send content with minimum bits for symbols with unequal likelihoods. This was presented first by Claude Shannon in his landmark paper, mentioned in Chapter 3. Shannon showed clearly the limits on how much data can be compressed in a lossless manner. In this section, we try to present intuitively how it works and also explain mathematically how the concepts are formulated.

让我们首先用一个简单的例子来说明,在这个例子中,英语句子正在被传达并且需要被压缩。在这种情况下,符号可以是英文字母表中的任何字母,因此符号的数量为 26。假设需要发送的单词是“seize”。当我们收到字母“s”时,下一个字母更有可能(更高的概率)是元音字母而不是字母“b”(低概率),这是该语言的一个特征。正如我们已经知道的那样,我们接下来更有可能收到一个元音字母,因此收到一个元音字母对我们来说比收到一个像“b”这样的字母所提供的信息更少。直观上,我们由此可以看出发生的可能性(概率)与信息之间的关系。信息论告诉我们用数学表达的完全相同的原理。

Let’s start by illustrating with a simple example where English sentences are being communicated and need to be compressed. In this case, the symbols could be any of the letters of the English alphabet and the number of symbols thus is 26. Let’s say the word that needs to be sent is “seize.” When we receive the letter ‘s’, it’s a characteristic of the language that the next letter is more likely (higher probability) to be a vowel than say a letter ‘b’ (low probability). As we already know we’re more likely to receive a vowel next, receiving a vowel thus has less information to us than receiving a letter like ‘b’. Intuitively, we can thereby see the relationship between likelihood of occurrence (probability) and information. Information theory tells us the exact same principle expressed mathematically.

假设我们有符号集 {x 1 , x 2 , x 3 … x n }。令 P(x m ) 表示符号 x m在没有其他信息的情况下出现的概率。然后,

Let’s say we have the symbol set {x1, x2, x3 … xn}. Let P(xm) represent the probability of symbol xm occurring in the absence of other information. Then,

P(x 1 ) +P(x 2 ) +P(x 3 ) ...+P(x n ) = 1

P(x1) +P(x2) +P(x3) ...+P(xn) = 1

INF (x m ) = -log 2 P(x m ) 位

INF (xm) = -log2 P(xm) bits

换句话说,

In other words,

INF (x m ) = log 2 [1 / P(x m )] 位

INF (xm) = log2 [1 / P(xm)] bits

其中 INF (x m ) 是以位为单位的 x m的信息。            

where INF (xm) is the information of xm in bits.            

让我们用一个简单的例子来说明上面的内容。让我们再次回到我们之前发送英文文本和单词“seize”的示例。暂时假设符号仅限于集合 {'s', 'e', 'i', 'z'},它们出现的概率如下:

Let us illustrate the above using a simple example. Let’s again come back to our earlier example of sending English text and the word "seize." Assuming for a moment the symbols are limited to the set {‘s’, ‘e’, ‘i’, ‘z’}, their probabilities of occurrence are as follows:

P(s) = ⅕, P(e) = ⅖, P(i) = ⅕, P(z) = ⅕

P(s) = ⅕, P(e) = ⅖, P(i) = ⅕, P(z) = ⅕

所以,

Therefore,

INF(s) = INF(i) = INF(z) = -log 2 (1/5) = 2.32

INF(s) = INF(i) = INF(z) = -log2(1/5) = 2.32

INF(e) = -log2(2/5) = 1.32

INF(e) = -log2(2/5) = 1.32

在这个简单的例子中,我们看到收到字母“e”的信息比收到字母“s”、“i”或“z”的信息少,因为它出现的概率更高。

In this simple example, we see that receiving the letter ‘e’, because it has higher probability of occurrence, has less information than receiving the letters ‘s’ or ‘i’ or ‘z’.

任何符号的信息量都与该符号出现的可能性成反比。可能性越高,符号携带的信息越少,因此该符号所需的位数应该越少,反之亦然。这是熵所采用的哲学编码方案,如可变长度编码方案或自 H.264 出现以来广泛使用的流行二进制算术编码

The information of any symbol is inversely proportional to the likelihood of occurrence of the symbol. The higher the likelihood, the less information carried by the symbol and hence the fewer should be the number of bits needed for that symbol and vice versa. This is the philosophy employed in entropy coding schemes like variable length coding scheme or the popular binary arithmetic coding used extensively since the advent of H.264.

我们也可以在示例序列中看到这一点,其中符号 (0, -1, 0) 出现 8 次,而 (0,2,0) 出现 3 次。在这种情况下,符号的可能性不相等,我们现在知道最好不要在所有符号之间分配相同的位,而是策略性地分配位,使得出现次数最多的符号获得最少的位,出现次数最少的符号获得大多数位。这仍将确保唯一的符号到位映射,同时优化位流中编码的位数。

We can also see this in our example sequence where the symbol (0, -1, 0) occurs 8 times and (0,2,0) occurs three times. In such a scenario wherein, the symbols are not equally likely, we now know it’s best to not assign equal bits across all symbols and instead strategically assign the bits such that the most-occurring symbols get the fewest bits and the least occurring symbols get the most bits. This will still ensure a unique symbol-to-bits mapping while optimizing the number of bits encoded in the bitstream.

8.1.1 熵的概念                  

术语在信息论中广泛使用,可以直观地认为是与特定内容相关的随机性的度量。正如我们之前看到的,内容或其相关符号越随机,它拥有的信息就越多,因此其传输需要更多的比特,反之亦然。

The term entropy is widely used in information theory and can be intuitively thought of as a measure of randomness associated with a specific content. As we have seen earlier, the more random the content or its associated symbols, the more information it has and thereby more bits are needed for its transmission and vice versa.

因此,熵只是来自内容的平均信息量,并衡量表达信息所需的平均“比特”数。

Entropy thus is simply the average amount of information from the content and measures the average number of ‘bits’ needed to express the information.

数学上,熵用公式表示:

Mathematically, entropy is expressed by the formula:

H(X) = −∑ m P( x m ) log 2 P( x m )

H(X) = −∑m P(xm) log2 P(xm)

H()代表熵,这是平均位数。

H () represents entropy, which is the average number of bits.

X 表示内容中所有值的集合。

X represents the set of all values in the content.

x m表示集合 X 的一个符号。

xm represents a symbol of the set X.

回到我们之前的示例,我们使用以下公式计算每个交易品种的信息: INF (x m ) = -log 2 P(x m ) 其中:

Coming back to our previous example, where we calculated the information for every symbol using the formula: INF (xm) = -log2 P(xm) wherein:

INF(s) = INF(i) = INF(z) = -log2(1/5) = 2.32

INF(s) = INF(i) = INF(z) = -log2(1/5) = 2.32

INF(e) = -log2 (2/5) = 1.32

INF(e) = -log2 (2/5) = 1.32

平均信息在这个简单的例子中,将是所有相关符号信息的加权平均值。在数学上,

The average information or entropy in this simple example would be a weighted average of the information of all the associated symbols. Mathematically,

熵 H (X) = - m P(x m ) * log 2 P(x m )

Entropy H (X) = - m P(xm) * log2 P(xm)

H(X) = ⅕ * 2.32 + ⅕ * 2.32 + ⅕ * 2.32 + ⅖ * 1.32 = 1.92

H (X) = ⅕ * 2.32 + ⅕ * 2.32 + ⅕ * 2.32 + ⅖ * 1.32 = 1.92

8.1.2 如何处理可能性或概率?

我们现在知道基于符号似然的熵编码有助于优化比特流中发送的比特数。同样清楚的是,熵或数据中的信息只有在讨论它们相关的概率分布时才有意义。然而,这些概率分布是如何确定的呢?在上面的示例中,我们计算了具有符号集“s”、“e”、“i”和“z”的一个单词的概率。如果我们有一个包含所有带有这个符号集的单词的扩展库,我们就可以从中学习并提高概率。同样的原理可以扩展到图像和视频中编码的符号,其中概率通常是通过在广泛的视频内容库中运行多个编码模拟并计算每个符号出现的次数来计算的。

We now know that symbol likelihood-based entropy coding helps to optimize the number of bits sent in the bitstream. It’s also clear that entropy or the information in the data makes sense only when discussed in relation to their associated probability distribution. However, how are these probability distributions determined? In the above example, we computed the probabilities for one word with the symbol set, ‘s’, ‘e’, ‘i’, and ‘z’. If we had an extended library of all words with this symbol set, we could learn from it and improve the probabilities. The same principle can be extended to the symbols that are encoded in image and video, where in the probabilities are usually calculated by running several encoding simulations across a wide library of video content and counting the number of occurrences of every symbol. Such simulations are done at the time of standardization and the base probability tables for every symbol that is encoded are usually included in the normative part of the standard.

由于视频序列中的内容和复杂性随着每一帧不断变化,相应的符号似然也在不断变化。那么如何在这种动态条件下自适应地确定符号似然呢?这是上下文自适应熵解决的挑战编码算法。这些在编码过程中保持每个符号的重复运行计数。这些算法使用计数来更新概率(称为上下文),如 H.264和 H.265中的每个 CTU 或宏块,或如 VP9 中的每个帧的末尾。由于运动矢量和残差等符号的统计数据在空间和时间上以及比特率等方面都有所不同,因此建议根据已编码的符号调整统计数据。这里的理念是,通过随着场景和图片或设置的变化不断更新概率,根据更新的出现更好地为符号分配比特可以提高编码效率。H.264定义了两种上下文自适应熵编码方案,即上下文自适应变长编码(CAVLC) 和上下文自适应二进制算术编码(CABAC)。但是,H.265 和 VP9 等较新的标准仅支持 CABAC,并且上下文维护和更新的方法略有不同。为了跟上新的趋势并努力使内容简洁,本书将只关注 CABAC。

As content and complexity within the video sequence is constantly changing with every frame, so are the corresponding symbol likelihoods. How could the symbol likelihoods then be adaptively determined under such dynamic conditions? This is the challenge that’s addressed by context adaptive entropy coding algorithms. These keep running counts of the recurrences of every symbol during the process of encoding. The algorithms use the counts to update the probabilities (called context), either every CTU or macroblock as in H.264 and H.265, or at the end of every frame as in VP9. As the statistics of symbols like motion vectors and residuals vary spatially and temporally and also across bit rates, and so on, adapting the statistics based on already coded symbols is recommended. The philosophy here is that, by constantly updating the probabilities as scenes and pictures or settings change, better allocation of bits for the symbols based on updated occurrences leads to better coding efficiency. H.264 defined two context adaptive entropy encoding schemes, namely, context adaptive variable length coding (CAVLC) and context adaptive binary arithmetic coding (CABAC). However, newer standards like H.265 and VP9 only support CABAC and have slightly different methods in which the contexts are maintained and updated. In keeping with the newer trends and in an effort to keep the contents concise, this book will focus solely on CABAC.

8.2 上下文自适应二进制算术编码

正如我们在前一节中看到的,现代熵有两个要素编码方案,即:

As we’ve seen in the earlier section, there are two elements to modern entropy coding schemes, namely:

1.编码算法(如算术编码

1. Coding algorithm (like arithmetic coding)

2.上下文适应性

2. Context adaptivity

CABAC 采用上述原则,并被发现通过以下方式实现良好的压缩性能:

CABAC employs the above principles and has been found to achieve good compression performance through:

(a) 根据元素的上下文为每个句法元素选择概率模型,

(a) selecting probability models for each syntax element according to the element’s context,

(b) 根据当地统计数据调整概率估计,以及

(b) adapting probability estimates based on local statistics, and

(c) 使用算术编码[ 1][2]

(c) using arithmetic coding. [1][2]

使用 CABAC 对数据符号进行编码涉及以下三个阶段,本节将对此进行详细说明:

Coding a data symbol using CABAC involves the following three stages that will be explained in detail in this section:

  1. 二值化
  2. Binarization
  3. 上下文建模
  4. Context modeling
  5. 算术编码
  6. Arithmetic coding

图像

图 71:上下文自适应二进制算术编码器的框图。

Figure 71: Block diagram of context adaptive binary arithmetic coder.

图 71 说明了此过程的各个阶段。在第一步,如果输入符号不是二进制值,则在称为二值化的过程中将其映射到相应的二进制值. 结果二进制值中的各个位称为二进制位。因此,我们不会对符号本身进行编码,而是将重点放在对它们映射的二进制等价物进行编码上。在为每个符号设计二进制值映射时,要注意确保没有二进制值模式出现在另一个二进制值中,以确保解码器接收到的每个二进制值都可以被唯一地解码并映射到编码符号​​。下一步是通过称为上下文建模的过程,根据过去的符号分布选择合适的模型. 最后一步是自适应算术编码阶段,它使用早期阶段提供的概率估计进行自我调整。由于输入符号的概率分布与其等效二进制二进制文件的概率分布高度相关,因此可以使用相邻符号的概率估计来相当准确地估计将被编码的二进制文件的概率。在对每个 bin 进行编码后,概率估计会立即更新并将用于对后续 bin 进行编码。

Figure 71 illustrates the stages of this process. At the first step, if the input symbol is not binary-valued, it is mapped to a corresponding binary value in a process called binarization. The individual bits in the resulting binary value are called bins. Thus, instead of encoding the symbols themselves, we will focus on encoding their mapped binary equivalents. In designing the binary value mapping for every symbol, care is taken to ensure no binary value pattern occurs within another binary value so as to ensure that every binary value received by the decoder can be uniquely decoded and mapped to an encoded symbol. The next step is to select a suitable model based on the past symbol distribution through a process called context modeling. The last step is the adaptive arithmetic encoding stage that adapts itself using the probability estimates that are provided by the earlier stages. As the probability distribution of the input symbol is highly correlated to the probability distribution of the bins of its binary equivalent, the probability estimates of the neighboring symbols can be used to estimate fairly accurately the probabilities of the bins that will be encoded. After every bin is encoded, the probability estimates are immediately updated and will be used for encoding subsequent bins.

Symbol 的概率分布bins 的概率分布

Probability distribution of Symbol Probability distribution of bins

8.2.1 二值化

大多数编码符号,例如残差、预测模式和运动矢量,都是非二进制值的。二值化是在算术编码之前将它们转换为二进制值的过程。因此这是一个预处理阶段。执行它以便随后可以使用更简单和统一的二进制算术编码方案,这与通常在计算上更复杂的m 符号算术编码相反。然而,应当注意,该二进制代码在传输之前由算术编码器进一步编码。二值化过程的结果是一个由若干位组成的二值化符号串。上下文建模的后续阶段,对二值化符号串的每一位重复算术编码和上下文更新。二进制字符串中的位也称为二进制位。在 H.265 中,二进制化方案对于不同的符号可能不同,并且可能具有不同的复杂性。在本书中,我们将举例说明一些已在 H.265 中使用的二值化技术。其中包括固定长度二值化技术和结合截断一元和 Exp-Golomb 二值化的级联二值化技术。相同的概念可以扩展到其他方案。

Most of the encoded symbols, for example, residuals, prediction modes, and motion vectors, are non-binary valued. Binarization is the process of converting them to binary values before arithmetic coding. It is thus a pre-processing stage. It is carried out so that subsequently a simpler and uniform binary arithmetic coding scheme can be used, as opposed to an m-symbol arithmetic coding that is usually computationally more complex. It should be noted, however, that this binary code is further encoded by the arithmetic coder prior to transmission. The result of the binarization process is a binarized symbol string that consists of several bits. The subsequent stages of context modeling, arithmetic encoding and context updates are repeated for each bit of the binarized symbol string. The bits in the binarized string are also called bins. In H.265, the binarization schemes can be different for different symbols and can be of varying complexities. In this book, we shall illustrate a few binarization techniques that have been employed in H.265. These include Fixed Length binarization technique and a concatenated binarization technique that combines Truncated Unary and Exp-Golomb binarization. The same concepts can be extended to other schemes.

8.2.1.1 固定长度(FL) 二值化

这是一个简单的二值化方案,其中符号的二值化字符串对应于符号值的实际二进制表示。

This is a simple binarization scheme wherein the binarization string for the symbol corresponds to the actual binary representation of the symbol value.

表 10:固定长度 (FL) 二值化的二进制代码.

Table 10: Binary codes for fixed length (FL) binarization.

X

x

B FL (x)

BFL (x)

0

0

0

0

0

0

0

0

1个

1

0

0

0

0

1个

1

2个

2

0

0

1个

1

0

0

3个

3

0

0

1个

1

1个

1

4个

4

1个

1

0

0

0

0

5个

5

1个

1

0

0

1个

1

6个

6

1个

1

1个

1

0

0

7

7

1个

1

1个

1

1个

1

如果 x 表示有限集的语法元素,使得 0 ≤ x < S,我们首先确定表示符号集值的范围所需的最少位数。这是由:

If x denotes a syntax element of a finite set such that 0 ≤ x < S, we first determine the minimum number of bits needed to represent the range of values of the symbol set. This is given by:

l FL = log 2 (S+1)

lFL = log2(S+1)

x 的FL 二值化字符串然后简单地由 x 的二进制表示与 l FL位给出。表 10 用一个取值 0 到 7 的符号的简单示例说明了这一点。

The FL binarization string of x is then simply given by the binary representation of x with lFL bits. Table 10 illustrates this with a simple example for a symbol that takes values from 0 to 7.

这里,S = 7,因此 l FL = log 2 8 = 3位。

Here, S = 7 and hence lFL = log28 = 3 bits.

由于这是固定长度的表示,因此很自然地将其应用于具有均匀分布的符号集。

As this is a fixed length representation, it’s natural to apply it to symbol sets that have a uniform distribution.

8.2.1.2 级联二值化

我们将描述的下一个二值化技术是两个基本方案的串联,即截断一元(TU) 技术和exp-Golomb技术。称为一元/k 阶 exp-Golomb (UEGk) 二值化的级联技术应用于 H.265 中的 MVD 和变换系数级别。一元代码是最简单的无前缀代码,易于实现并且易于上下文适配。然而,由于较大的符号值并不能真正受益于上下文自适应,因此使用截断一元技术作为前缀和静态 exp-Golomb 代码作为后缀的组合。在下一节中,我们将描述这些技术中的每一种,以及如何将它们用于串联。

The next binarization technique we shall describe is a concatenation of two basic schemes, namely, truncated unary (TU) technique and exp-Golomb technique. The concatenated technique, called unary/kth order exp-Golomb (UEGk) binarizations, is applied to MVDs and transform coefficient levels in H.265. The unary code is the simplest prefix-free code to implement and easy for context adaptation. However, as larger symbol values don’t really benefit from context adaptation, the combination of a truncated unary technique as a prefix and a static exp-Golomb code as a suffix is used. In the following section, we describe each of these techniques and also how they are used in concatenation.

8.2.1.3 截断一元二值化技术

TU编码方案类似于用于表示非负数的一元编码方案。在该方案中,任何符号 x 都属于具有 S 值的符号集,使得 0 ≤ x ≤ S 由x 个“1”位和一个额外的“0”终止位表示。因此,此转换后的 bin 串的长度为 x + 1。在 TU 中,对于 x = S,不使用终止位,仅使用 x 个“1”位,从而产生 S 位作为最大值。

The TU coding scheme is similar to the unary coding scheme that is used to represent non-negative numbers. In this scheme, any symbol x belongs to a symbol set with S values such that 0 ≤ x ≤ S is represented by x ‘1’ bits and an extra ‘0’ termination bit. The length of this transformed bin string is thus x + 1. In TU, for x = S, the termination bit is not used and only x ‘1’ bits are used, resulting in S bits for the maximum value.

表 11 用一个简单的示例说明了这一点,该符号的值从 0 到 8。这里,S = 8,因此 l TU = 8位。

Table 11 illustrates this with a simple example for a symbol which takes values from 0 to 8. Here, S = 8 and hence lTU = 8 bits.

表 11:TU 二值化的二进制代码.

Table 11: Binary codes for TU binarization.

X

x

BTU ( x)

BTU (x)

0

0

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

2个

2

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

3个

3

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

4个

4

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

5个

5

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

6个

6

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

 

 

7

7

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

8个

8

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

8.2.1.4 Exp-Golomb二值化技术

k阶 exp-Golomb码 (EG k ) 是通过使用一元码的组合导出的,该一元码用作前缀并用具有以下可变长度 (l s )的后缀填充:

A k-th order exp-Golomb code (EGk) is derived by using a combination of unary code that is used as a prefix and padded with a suffix having the following variable length (ls):

l s = k + l p - 1

ls = k + lp - 1

其中 l p是一元前缀代码的长度。

where lp is the length of the unary prefix code.

表 12:0 阶和 1 阶 exp-Golomb 二值化的二进制代码代码。

Table 12: Binary codes for 0th and 1st order exp-Golomb binarization code.

X

x

l EG0

lEG0

一元

Unary

后缀

Suffix

 

 

 

 

X

x

l EG1

lEG1

一元

Unary

后缀

Suffix

0

0

1个

1

 

 

 

 

 

 

0

0

 

 

 

 

 

 

 

 

 

 

0

0

2个

2

 

 

 

 

0

0

0

0

 

 

 

 

1个

1

3个

3

 

 

 

 

1个

1

0

0

0

0

 

 

 

 

 

 

 

 

1个

1

2个

2

 

 

 

 

0

0

1个

1

 

 

 

 

2个

2

3个

3

 

 

 

 

1个

1

0

0

1个

1

 

 

 

 

 

 

 

 

2个

2

4个

4

 

 

1个

1

0

0

0

0

0

0

 

 

3个

3

5个

5

 

 

1个

1

1个

1

0

0

0

0

0

0

 

 

 

 

 

 

3个

3

4个

4

 

 

1个

1

0

0

0

0

1个

1

 

 

4个

4

5个

5

 

 

1个

1

1个

1

0

0

0

0

1个

1

 

 

 

 

 

 

4个

4

4个

4

 

 

1个

1

0

0

1个

1

0

0

 

 

5个

5

5个

5

 

 

1个

1

1个

1

0

0

1个

1

0

0

 

 

 

 

 

 

5个

5

4个

4

 

 

1个

1

0

0

1个

1

1个

1

 

 

6个

6

5个

5

 

 

1个

1

1个

1

0

0

1个

1

1个

1

 

 

 

 

 

 

6个

6

6个

6

1个

1

1个

1

0

0

0

0

0

0

0

0

7

7

7

7

1个

1

1个

1

1个

1

0

0

0

0

0

0

0

0

 

 

 

 

7

7

6个

6

1个

1

1个

1

0

0

0

0

0

0

1个

1

8个

8

7

7

1个

1

1个

1

1个

1

0

0

0

0

0

0

1个

1

 

 

 

 

8个

8

6个

6

1个

1

1个

1

0

0

0

0

1个

1

0

0

9

9

7

7

1个

1

1个

1

1个

1

0

0

0

0

1个

1

0

0

 

 

 

 

9

9

6个

6

1个

1

1个

1

0

0

0

0

1个

1

1个

1

通常,第 k 阶 EG k代码在不同的 k 值之间具有相同的前缀,但后缀因 k 倍而不同。每个第 k 阶 EG k代码方案都以 k 个后缀位开始并从那里进行。表 12 给出了 k=0 和 k=1 的EG k代码示例。EG 0代码方案以 0 位开始作为其第一个值 x=0 的后缀,然后添加 1 位作为 x=1,2 的后缀,并且很快。相反,EG 1方案以 x=0,1 的 1 位后缀代码开始,依此类推。

In general, k-th order EGk code has the same prefix across different k values but varies in suffix by a factor of k. Every k-th order EGk code scheme starts with k suffix bits and progresses from there. Examples for EGk codes for k=0 and k=1 are given in Table 12. EG0 code schemes start with 0 bits for suffix for their first value x=0 and then add 1 bit for suffix for x=1,2 and so on. In contrast, EG1 schemes start with 1-bit suffix code for x=0,1 and so on.

现在我们知道了 TU 和 EG k代码的工作原理,让我们以一个 UEG k二值化代码方案的示例来结束本节,该方案涉及 TU 和 EG k代码的简单串联。如下表 13 所示,该方案使用具有截断截止值 S = 14 的 TU 前缀和 k = 0 阶的 EG k后缀。

Now that we know how TU and EGk codes work, let’s conclude this section with an example of a UEGk binarization code scheme that involves a simple concatenation of TU and EGk codes. As illustrated in Table 13 below, the scheme uses a TU prefix with a truncation cut-off value S = 14 and EGk suffix of order k=0.

这种类型的不同方案在 H.265 中部署用于不同的符号,如 MVD 和变换系数。这些方案的主要区别在于 TU 方案的截止点以及 EG k后缀的 k 阶数。这些值是在仔细考虑这些符号的典型幅度及其概率分布之后选择的。

Different schemes of this type are deployed in H.265 for different symbols like MVDs and transform coefficients. These schemes vary primarily in the cut-off points for the TU scheme and also the order k of the EGk suffix. These values are chosen after a careful consideration of the typical magnitudes of these symbols and their probability distributions.

表 13: UEG 0二值化的二进制代码.

Table 13: Binary codes for UEG0 binarization.

UEG k (x)

UEGk (x)

X

x

B TU (x) (前缀)

BTU (x) (PREFIX)

EG k (后缀)

EGk (SUFFIX)

0

0

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2个

2

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3个

3

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4个

4

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

:

:

:

:

:

:

:

:

:

:

:

:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

13

13

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

 

 

14

14

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

 

 

15

15

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

 

 

 

 

 

 

 

 

16

16

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

0

0

 

 

 

 

17

17

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

1个

1

 

 

 

 

18

18

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

0

0

0

0

19

19

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

0

0

1个

1

20

20

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

1个

1

0

0

21

21

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

1个

1

0

0

1个

1

1个

1

VP9 视频标准采用非常相似的设计框架,在实现选择和术语上略有不同。将每一个非二元符号二值化构造二叉树,二叉树的每个内部节点对应一个bin。遍历树,并在每个节点运行二进制算术编码器以对特定符号进行编码。每个节点 (bin) 都有一个 8 位精度的关联概率,节点的这组概率是符号的维护上下文。

VP9 video standard employs a very similar design framework with slight differences in implementation choices and terminology. Every non-binary symbol is binarized to construct a binary tree and each internal node of the binary tree corresponds to a bin. The tree is traversed, and the binary arithmetic coder is run at each node to encode a particular symbol. Every node (bin) has an associated probability with 8-bit precision and this set of probabilities of the nodes is the maintained context for the symbol.

现在我们已经了解了如何将非二进制符号二进制化以生成二进制二进制流,让我们深入研究下一步的细节:上下文建模

Now that we’ve understood how the non-binary symbols are binarized to produce a binary bin-stream, let us delve into the details of the next step: context modeling.

8.2.2 上下文建模

图像

图 72:上下文建模和算术编码。

Figure 72: Context modeling and arithmetic Coding.

每个 bin 都有一个相关联的概率是“1”或“0”。这是由它的上下文模型决定的。根据先前编码符号的统计数据,从许多可用模型中为每个 bin 选择一个模型。图 72 摘自前面的图 71。它显示了上下文建模如何与算术编码阶段紧密相关。该模型告诉二进制编码器输入 bin 字符串中 bin 的概率是多少。如果模型提供了 bins 的准确概率估计,它们将被优化编码。

Each bin has an associated probability of being a ‘1’ or ‘0’. This is determined by its context model. One model is chosen for every bin from many available models based on the statistics of previous coded symbols. Figure 72 is extracted from the earlier Figure 71. It shows how context modeling is closely tied to the arithmetic encoding stage. The model tells the binary encoder what the probability of a bin is in an input bin string. If the model provides an accurate probability estimation of the bins, they will be encoded optimally.

另一方面,如果模型中存在不准确性,则会导致编码效率下降。因此,很明显,算术编码的效率与从建模阶段提供的符号或 bin 的概率密切相关,并且算术编码器设计提供了灵活的上下文自适应机制来动态更新符号或 bin 的概率分布。这将在算术编码部分以示例进一步说明

If, on the other hand, inaccuracies exist in the model, this would result in a loss in coding efficiency. Thus, it’s clear that the efficiency of arithmetic encoding is closely tied to the probabilities of the symbols or bins provided from the modeling stage and the arithmetic coder design provides a flexible context adaptation mechanism to update the probability distribution of the symbols or bins dynamically. This is further illustrated with an example in the section on arithmetic coding.

编码后,算术编码器分析生成的编码位序列并相应地更新概率分布(上下文模型)。模型将使用这些更新的概率来提供输入以对下一个输入 bin 字符串进行编码。模型的更新在算术编码中至关重要,但它们是以编码器和解码器的更高复杂度为代价的。不同的标准选择了不同的策略来在复杂度和编码增益之间取得平衡。

After encoding, the arithmetic coder analyzes the sequence of resulting coded bits and correspondingly updates the probability distribution (context model). These updated probabilities will be used by the model to provide input to encode the next input bin string. The updates to the model are critical in arithmetic coding but they come at the expense of higher complexity to both the encoder and the decoder. Different standards have chosen different strategies to strike a balance between the complexity and coding gains.

H.264和 H.265使用连续的、每个符号的概率更新。这意味着在每个比特被编码后,相应的概率被更新。这在帧内编码中特别有用。但是,这些概率不会暂时结转,而是每帧重置一次。这意味着这些编解码器不利用帧之间的编码冗余。另一方面,VP9 通过在框架内保持概率不变来做不同的事情。这意味着在编码每个超级块之后没有每个符号更新. 我相反,概率在编码每一帧后更新。这样做是为了使解码器实现更简单。这种基于先前帧的符号统计更新上下文而比特流中没有任何显式信令的机制称为后向自适应. VP9 还提供了一种机制,能够在编码/解码之前在每个帧的标头中明确表示概率。这种机制称为前向更新。通过能够调整帧之间的概率,VP9 获得了压缩效率,尤其是在关键帧之间使用多个帧间帧的较长 GOP 期间。

H.264 and H.265 use continual, per-symbol probability updates. This means that after each bit is encoded, the corresponding probability is updated. This is useful in intra coding particularly. However, these probabilities are not temporally carried forward but are reset every frame. This means that these codecs don't take advantage of coding redundancies between frames. VP9, on the other hand, does things differently by keeping the probabilities constant within the frame. This means that there is no per-symbol update after encoding every superblock. Instead, probabilities are updated after encoding every frame. This is done to keep the decoder implementation simpler. This mechanism of updating the contexts based on the symbol statistics of the previous frames without any explicit signaling in the bitstream is called backward adaptation. VP9 also provides a mechanism to be able to explicitly signal the probabilities in the header of each frame before it’s encoded/decoded. This mechanism is called forward updates. By being able to adapt probabilities between frames, VP9 derives compression efficiency, especially during longer GOPs where several inter frames are used between key frames.

8.2.3 算术编码

根据信息论,出现概率很高的符号往往具有平均信息或熵, 小于 1 位。与霍夫曼编码(每个符号不能编码少于一位)等先前方案相比,算术编码的一个主要优点是算术编码允许以小数位数对符号进行编码。这导致压缩效率提高,特别是对于视频内容,因为它往往具有高相关性的符号,因此具有高概率。虽然每个符号的小数位数乍一看似乎是一个模糊的想法,但这又是一个变换或视角改变的问题。让我们探讨这个想法,并在本节中说明每个符号的小数位数是如何可能的。

According to information theory, symbols with very high probability of occurrence tend to have average information, or entropy, of less than 1 bit. A primary advantage of arithmetic coding over previous schemes like Huffman coding (which cannot encode less than one bit-per-symbol) is that arithmetic coding allows for encoding symbols at a fractional number of bits. This leads to improved compression efficiency, especially for video content, as this tends to have symbols with high correlation and hence high probabilities. While a fractional number of bits per symbol can seem a vague idea at first, it’s again a matter of transform or change of perspective. Let’s explore the idea and illustrate in this section how a fractional number of bits per symbol are even possible.

请注意,算术编码的细节适用于二进制和非二进制符号。然而,请牢记实现简单性的重要性,H.265和 VP9 等视频标准采用二进制算术编码,这将是本书的重点。

Note that the details of arithmetic coding are applicable for both binary and non-binary symbols. However, keeping in mind the importance of simplicity in implementation, video standards like H.265 and VP9 employ binary arithmetic coding and this will be the focus of this book.

8.2.3.1 算术编码基础

在本节中,我将首先解释核心算术编码思想及其如何实现每个符号的小数位。我们将通过一个例子来工作。算术编码的过程涉及我们所说的变换 操作。待编码的符号被转化为编码数区间,连续的bin通过递归区间细分被转化和编码。这个过程的结果是,一系列的 bin 可以由定义的小数区间中的单个小数表示,然后将其二进制表示编码在比特流中。

In this section, I will explain first the core arithmetic coding idea and how it’s able to achieve fractional bits per symbol. We will work through an example. The process of arithmetic coding involves what we can call a transform operation. The symbols to be encoded are transformed into coded number intervals and successive bins are transformed and coded by recursive interval sub-division. The outcome of this process is that a sequence of bins can be represented by a single fractional number in a defined fractional interval whose binary representation is then encoded in the bitstream.

图像

图73:二进制算术编码过程.

Figure 73: Process of binary arithmetic coding.

这意味着不再直接编码符号 bin 流,而是对其映射的分数表示进行编码。正是这种映射过程有助于以更好的压缩效率进行编码,从而使最终比特流与输入二进制流相比具有小数位数。这个过程严重依赖于输入概率上下文模型,我们将很快展示它是如何发生的。二进制算术编码过程中涉及的阶段如图 73 所示。

This means that no longer is the symbol bin stream directly encoded but, rather, its mapped fractional representation is instead encoded. It is this mapping process that facilitates encoding with better compression efficiency, such that the final bitstream has a fractional number of bits compared to the input bin stream. This process relies critically on the input probability context model and we will show how this is so shortly. The stages involved in the binary arithmetic coding process are illustrated in Figure 73.

假设我们有以下要编码的 7 位流

Let us assume we have the following stream of 7 bits to be coded

[0 1 0 0 0 0 1] 和 P 0 = 0.7 和 P 1 = 0.3

[0 1 0 0 0 0 1] and P0 = 0.7 and P1 = 0.3

要开始该过程,假定可用区间范围为 [0, 1],目标是确定特定于符号序列的最终区间并从该区间中选择任意数字。为此,二进制符号被一个接一个地获取,并根据它们的概率分配子间隔,如图 74 中示例二进制流所示。

To start the process, the available interval range is assumed to be [0, 1] and the goal is to identify a final interval specific to the sequence of symbols and to pick any number from that interval. To do this, the binary symbols are taken one by one and assigned sub-intervals based on their probabilities, as illustrated in Figure 74 for our example bin stream.

图像

图 74:使用算术编码对样本序列进行编码的图示.

Figure 74: Illustration of coding a sample sequence using arithmetic coding.

第一位是“0”。它具有 P 0 = 0.7 并分配区间 [0, 0.7]。该间隔被选作编码下一位的初始间隔。在这种情况下,下一位为“1”(P 1 = 0.3)。基于此,区间 [0,0.7] 进一步细分为 [0.49, 0.7]。然后这成为下一位的初始间隔,依此类推。

The first bit is ‘0’. It has P0 = 0.7 and is assigned the interval [0, 0.7]. This interval is chosen as the initial interval to encode the next bit. In this case, the next bit is ‘1’ (with P1 = 0.3). Based on this the interval [0,0.7] is further broken down into [0.49, 0.7]. This then becomes the initial interval for the next bit and so on.

因此,该过程可以概括为以下 3 个步骤:

The process can thus be summarized in the following 3 steps:

  1. 将区间初始化为 [0,1]。
  2. Initialize the interval to [0,1].
  3. 根据传入比特及其概率值计算子区间。
  4. Calculate the sub-interval based on the incoming bit and its probability value.
  5. 使用子区间作为下一位的初始区间并重复步骤 2。
  6. Use the subinterval as the initial interval for the next bit and repeat step 2.

在我们的例子中,最后的小数区间是 [0.5252947, 0.540421]。如果我们在这个区间内选择一个数字,比如说 0.54,那么它的二进制等价物就是 0.10001。然后可以以 5 位的形式在比特流中发送:[10001]。从这个例子中可以清楚地看出,如何仅使用 5 位就可以紧凑地表示一系列 7 个输入二进制符号,从而实现每个符号的小数位数——在本例中为每个符号 0.714 位。

In our example, the final fractional interval is [0.5252947, 0.540421]. If we pick a number in this interval, say, 0.54, its binary equivalent is 0.10001. This can be then sent in the bitstream in the form of 5 bits: [10001]. It’s clear from this example how a series of 7 input binary symbols can be compactly represented using just 5 bits, thereby achieving a fractional number of bits per symbol—0.714 bits per symbol in this case.

现在,我们将使用相同的输入位序列但使用不同的概率(例如 P 0 = 0.4 和 P 1 = 0.6)来探索上下文概率如何影响此编码方案。图 75 显示了如何对该序列进行编码。在这种情况下,最终的小数区间被发现为 [0.1624576, 0.166144]。如果我们在这个区间内选择一个数字,比如说 0.165,那么它的二进制等价物就是 0.0010101。这可以使用最少 7 位在比特流中有效地表示。这比概率 P 0 = 0.7 和 P 1 = 0.3 所需的 5 位多。这清楚地证明了准确的概率上下文模型对于使用算术编码方案提供编码增益的至关重要性

Now we'll explore how context probabilities affect this coding scheme using the same sequence of input bits but using different probabilities, say, P0 = 0.4 and P1 = 0.6. Figure 75 shows how this sequence will be coded. In this scenario, the final fractional interval is found to be [0.1624576, 0.166144]. If we pick a number in this interval, say, 0.165, its binary equivalent is 0.0010101. This can be effectively represented in the bitstream using a minimum of 7 bits. This is more than the 5 bits needed with probabilities P0 = 0.7 and P1 = 0.3. This clearly demonstrates the critical importance of accurate probability context models to provide coding gains using an arithmetic coding scheme.

图像

图 75:使用不同的上下文概率对样本序列进行编码。

Figure 75: Coding the sample sequence using different context probabilities.

然而,算术编码提供了一种灵活的方式来抵消不准确的概率,方法是基于编码符号建立概率估计的动态适应。在上面的示例中,初始 P 0 = 0.4 和 P 1 = 0.6,我们可以构建一个简单的机制来根据到该阶段的编码位来调整每个阶段的符号概率。

However, arithmetic coding provides a flexible way to offset inaccurate probabilities by building in dynamic adaptation of probability estimates, based on the encoded symbols. In the above example with the initial P0 = 0.4 and P1 = 0.6, we can build in a simple mechanism to adapt the symbol probabilities at every stage, based on the bits encoded up to that stage.

图像

图 76:算术编码中动态概率适配示意图.

Figure 76: Illustration of dynamic probability adaptation in arithmetic coding.

然后我们可以相应地更新间隔,克服初始概率不准确的限制。然后我们可以使用数字表示相同符号的序列,例如 0.22,在由 [00111] 表示的最终区间 [0.2181398528, 0.22239488] 内仅使用 5 位。这在上面的图 76 中进行了说明。图 76 还强调了如何在每一步更新P 0以更好地反映符号概率。

Then we can update the intervals accordingly, overcoming the limitation of the inaccurate initial probability. We can then represent the sequence of the same symbols using a number, say, 0.22, within the final interval [0.2181398528, 0.22239488] represented by [00111] using only 5 bits. This is illustrated in Figure 76, above. Figure 76 also highlights how P0 is updated at every step to better reflect the symbol probabilities.

8.2.3.2 算术解码

图像

图 77:解码算术编码比特流的图示。

Figure 77: Illustration of decoding an arithmetic coded bitstream.

在探索了算术编码的工作原理之后,现在让我们探索它的反向操作,即算术解码是如何工作的。这在图 77 中进行了说明,并且与之前的二进制编码步骤相反。区间范围初始化为 [0,1],已知概率 P0 = 0.7 和 P1 = 0.3。这些值通常由解码器隐式地动态计算。当解码器收到二进制序列 [10001] 时,它会将其解释为小数 0.10001 对应于十进制 0.53。区间 [0,1] 可以根据上下文概率连续细分,如下所示。由于0.53125介于0和0.7之间,因此第一个符号被确定为0。然后将区间设置为[0,0.7],子区间为[0,0.49]和[0.49,0.7]基于P 0价值。由于 0.53125 现在位于对应于 P 1 = 0.3 概率区间的 0.49 和 0.7 之间,因此符号被确定为 1。连续重复此步骤以到达符号位序列,即 [0 1 0 0 0 0 1]如图77所示。

Having explored how the arithmetic encoding works, let us now explore how its reverse operation, namely, arithmetic decoding works. This is illustrated in Figure 77 and is the reverse of the earlier binary encoding steps. The interval range is initialized to [0,1] with known probabilities P0 = 0.7 and P1 = 0.3. These values are usually implicitly and dynamically computed by the decoder. When the decoder receives the binary sequence [10001], it interprets it as the fraction 0.10001 corresponding to decimal 0.53. The interval [0,1] can be successively sub-sectioned based on the context probabilities as follows. Since 0.53125 lies between 0 and 0.7 the first symbol is determined as 0. Then the interval is set to [0,0.7] and the sub-intervals are [0,0.49] and [0.49,0.7] based on the P0 value. Since 0.53125 now lies between 0.49 and 0.7 that corresponds to P1 = 0.3 probability interval, the symbol is determined to be 1. This step is repeated in succession to arrive at the sequence of symbol bits, namely, [0 1 0 0 0 0 1] as shown in Figure 77.

上一节中描述的算术编码过程涉及乘法。这在 CABAC 实现中通过使用标准中指定的查找表来近似间隔值来避免。这种方法可能会影响压缩效率,但需要保持实施简单。此外,在内部编码/解码过程中,当间隔范围低于标准中指定的阈值时,将启动重置过程。在此过程中,来自先前间隔的比特被写入比特流并且该过程进一步继续。

The arithmetic coding process described in the previous section involves multiplication. This is avoided in CABAC implementations by approximating the interval values using lookup tables that are specified in the standard. This approach can potentially impact compression efficiency but is needed to keep the implementation simple. Also, during the internal encoding/decoding process, when the interval range drops below thresholds that are specified in the standard, a reset process is initiated. In this process, the bits from previous intervals are written to the bitstream and the process continues further.

8.3 概括            
  • 信息论帮助我们理解如何最好地为具有不同可能性的符号发送具有最少比特的内容。熵编码是基于信息论
  • Information theory helps us understand how best to send content with minimum bits for symbols having unequal likelihoods. Entropy coding is based on information theory.
  • 术语熵在信息论中广泛使用,可以直观地认为是与特定内容相关的随机性的度量。换句话说,它只是内容的平均信息量,并衡量表达信息所需的平均“比特”数。
  • The term entropy is widely used in information theory and can be intuitively thought of as a measure of randomness associated with a specific content. In other words, it is simply the average amount of information from the content and measures the average number of ‘bits’ needed to express the information.
  • 上下文自适应熵编码方法在编码期间保持运行符号计数,并使用它来更新每个块级别或编码帧结束时的概率(上下文)。这样做时,熵编码上下文概率会随着内容的编码而“适应”或动态变化。
  • Context adaptive entropy coding methods keep a running symbol count during encoding and use it to update probabilities (contexts) either at every block level or at the end of coding the frame. In doing so, the entropy coding context probability 'adapts' or changes dynamically as the content is encoded.
  • CABAC 是上下文自适应熵编码方法依次包括以下功能:二值化、上下文建模和算术编码
  • CABAC is a context adaptive entropy coding method which includes the following functions in sequence: binarization, context modeling and arithmetic coding.
  • 算术编码涉及变换操作,其中符号被变换为编码数字区间并且连续符号通过递归区间细分被编码。
  • Arithmetic coding involves a transformative operation where the symbols are transformed to coded number intervals and successive symbols are coded by recursive interval sub-division.
8.4 笔记
  1. Marpe D, Schwarz H, Wiegand T. H.264 /AVC 视频压缩中基于上下文的自适应二进制算术编码。CH Huang 在 IEEE CSVT 会议上的口头报告;2003 年 7 月。https://slideplayer.com/slide/5674258/ 2018 年 9 月 22 日访问。
  2. Marpe D, Schwarz H, Wiegand T. Context-based adaptive binary arithmetic coding in the H.264/AVC video compression. Oral presentation by C-H Huang at: IEEE CSVT meeting; July 2003. https://slideplayer.com/slide/5674258/ Accessed September 22, 2018.
过滤

图像

 

 

解块或环路滤波器应用于解码(重构) 像素去除块边界周围的伪影,从而提高视觉质量和预测性能。它使用自适应滤波器抽头来平滑块边界并消除由于块编码而形成的边缘伪影。与 MPEG2 和 MPEG4 Part 2 等旧标准不同, H.264 、H.265和 VP9 中的环路滤波是规范过程的一部分。这意味着它在解码像素之后和这些像素可用于进一步预测之前应用于编码和解码管道。

A deblocking or in-loop filter is applied to the decoded (reconstructed) pixels to remove the artifacts around block boundaries, thereby improving visual quality and prediction performance. It does this using adaptive filter taps to smooth the block boundaries and remove edge artifacts that are formed as a result of block coding. Unlike older standards like MPEG2 and MPEG4 Part 2, the in-loop filtering in H.264, H.265 and VP9 is part of the normative process. This means that it is applied in the encoding and decoding pipeline after decoding the pixels and before these pixels can be used for further prediction.

9.1 为什么需要环路滤波?

正如我们在前面的章节中看到的,量化是一个有损过程,它会导致重建像素。这些是原始源像素的近似值。在基于块的处理架构中,变换和后续量化过程发生在块级别。这些往往会沿着像素的边缘引入人工块效应。这会削弱压缩效率。例如,假设我们有以下两个编码的 4x4 相邻块:

As we have seen in previous chapters, quantization is a lossy process that results in reconstructed pixels. These are approximations of the original, source pixels. The process of transforms and subsequent quantization happens on a block level, in block-based processing architectures. These tend to introduce artificial blocking artifacts along the edges of pixels. This can impair compression efficiency. As an example, assume we have the following two 4x4 neighboring blocks that are encoded:

图像

量化和逆量化过程产生以下块。

The process of quantization and inverse quantization results in the following blocks.

图像

虽然单独转换和量化的每个 4x4 块似乎都是原始输入的合理近似值,但局部平均导致沿 4x4 垂直边缘的明显不连续性,这在源中不存在。这在块处理中很常见。需要进一步处理以减轻这种人工创建的边缘的影响。去块该过程用于识别此类边缘,分析它们的强度,然后在此类已识别的块边界上应用过滤以消除这些不连续性。这样过滤的块边界通常包括变换块之间的边缘以及不同模式的块之间的边缘。

While each of the 4x4 blocks that are separately transformed and quantized appear to be a reasonable approximation of the original input, the local averaging has resulted in a stark discontinuity along the 4x4 vertical edge that was not present in the source. This is quite common in block processing. Further processing is needed to mitigate the impact of this artificially created edge. The deblocking process works to identify such edges, analyze their intensities and then applies filtering across such identified block boundaries to smooth off these discontinuities. The block boundaries thus filtered usually include the edges between transform blocks and also the edges between blocks of different modes.

在 HEVC 中,环路滤波在内部流水线化为两个阶段,第一阶段去块过滤器后紧跟样本自适应偏移 (SAO) 过滤器。首先应用与 H.264 中类似的去块过滤器。它在块边界上运行以减少由基于块的变换和编码产生的伪影。随后,图片通过 SAO 过滤器。此过滤器通过包含不在块边界上的像素以及块边界上的像素来进行基于样本的平滑。这些过滤器独立运行,可以单独激活和配置。在 VP9 中,使用了具有更高滤波器抽头的增强型环路滤波器,并且 SAO 滤波器不是标准的一部分。在以下部分中,我们将详细探讨这两个过滤器。

In HEVC, the in-loop filtering is pipelined internally to two stages with a first stage deblocking filter followed immediately by a sample adaptive offset (SAO) filter. The deblocking filter, similarly to what occurs in H.264, is applied first. It operates on block boundaries to reduce the artifacts resulting from block-based transform and coding. Subsequently, the picture goes through the SAO filter. This filter does a sample-based smoothing by encompassing pixels that are not on block boundaries, in addition to those that are. These filters operate independently and can be activated and configured separately. In VP9, an enhanced in-loop filter with higher filter taps is used and SAO filter is not part of the standard. In the following sections, we shall explore both these filters in detail.

9.2 解块过滤器

这是 H.265 中使用的第一个过滤器,也是 H.264 和 VP9 中使用的唯一过滤器。它在块边缘应用边界边缘、像素自适应平滑操作。省略了对图片边界的过滤。该过滤器具有以下两个显着优势。

This is the first filter used in H.265 and the only filter used in H.264 and VP9. It applies a boundary edge, pixel-adaptive smoothing operation across block edges. Filtering on picture boundaries is omitted. This filter provides the following two significant benefits.

  1. 通过选择性地自适应地平滑块边缘,解码后的图片看起来更好,从而增强了整体视觉体验。这在低比特率下尤其如此,此时块边缘伪像由于较高的量化级别而变得更加明显。
  2. By selectively and adaptively smoothing out block edges, the decoded pictures look much better thereby enhancing the overall visual experience. This is especially true at low bit rates when block edge artifacts tend to become more visible due to higher quantization levels.
  3. 由于在重建像素之后但在这些像素可用于预测之前应用滤波,因此提高了预测效率。这导致更少和更小的残差,因此更好的编码增益。
  4. As the filtering is applied after reconstructing the pixels but before these pixels can be used for prediction, the efficiency of the prediction is improved. This results in fewer and smaller residuals, hence better coding gains.

需要注意的是,启用滤波器和确定滤波器强度是根据相邻块的特性自适应地完成的。这些特性包括块编码模式、使用的 QP 值和边界像素差异。这确保应用针对特定块边界定制的最佳过滤级别,既不会太少也不会太多。它还确保在去除边缘伪影的同时,仔细分析和保留图片中固有的锐利边缘和细节。

It should be noted that enabling the filter and determining the filter strengths are done adaptively based on the characteristics of the neighboring blocks. These characteristics include blocks coding modes, the QP values used and the boundary pixel differences. This ensures that an optimal level of filtering customized to the specific block boundary is applied, neither too little nor too much. It also ensures that, while edge artifacts are removed, sharp edges and details inherent in the picture are carefully analyzed and retained.

在 HEVC 中,解块可以通过称为图片参数集(PPS)的流级标头中的信号来禁用或启用过滤器,并且可以在切片标头中指定其他参数。当使用无损编码模式时,它也会被禁用,例如显式 PCM 编码或在绕过变换和量化的情况下。在 VP9 中,没有序列头信令,解块参数作为图片头的一部分包含在内。所有这些标准都分两个阶段执行解块。首先过滤垂直边缘,然后在垂直处理过的像素上处理水平边缘。例如,图 78 说明了 64x64 超级块中内部边缘的顺序在 VP9 中过滤。环路滤波器以超级块的光栅扫描顺序运行。对于每个超级块,它首先应用于垂直边界,如粗线所示。然后将其应用于水平边界,如虚线所示。图 78 中的数字还表示应用过滤的顺序。

In HEVC, the deblocking filter can be disabled or enabled by signaling in the stream-level header called picture parameter set (PPS) and further parameters can be specified in the slice header. It is also disabled when a lossless coding mode is used, such as explicit PCM coding or in scenarios where transform and quantization is bypassed. In VP9, there is no sequence header signaling and deblocking parameters are included as part of the picture header. All these standards perform deblocking in two stages. First, the vertical edges are filtered, and then horizontal edges are processed on the vertically processed pixels. As an example, Figure 78 illustrates the order in which the internal edges in a 64x64 superblock are filtered in VP9. The loop filter operates on a raster scan order of superblocks. For each superblock, it is first applied to vertical boundaries as shown in the thick lines. It is then applied to horizontal boundaries as shown by the dotted lines. The numbers in Figure 78 also indicate the order in which the filtering is applied.

VP9 提供多达四个解块过滤器选项。它们在边缘两侧最多 7 个像素上运行。HEVC 提供了两个过滤器,可以修改块边界两侧最多 3 个像素。虽然过滤的此类细节可能因标准而异,但从概念上讲它们是相同的并且遵循将在此处解释的类似处理技术。

VP9 provides up to four deblocking filter options. These operate on up to 7 pixels on either side of the edges. HEVC provides two filters that modify up to 3 pixels on either side of the block boundary. While such details of filtering can vary across standards, conceptually they are identical and follow similar processing techniques that will be explained here.

图像

图 78:解块处理顺序对于超级块的 4x4 块在 VP9 中。

Figure 78: Order of processing of deblocking for 4x4 blocks of a superblock in VP9.

9.2.1 解块过程

彻底解封过程包括以下三个阶段。前两个阶段是分析和决策步骤。最后一个阶段是实际的过滤步骤。

The complete deblocking process involves the following three stages. The first two stages are analysis and decision steps. The last stage is the actual filtering step.

9.2.1.1 确定需要解块的边缘

这是第一个处理阶段,其中块内的边缘被识别并标记为去块. 并不是所有的边都会被解块,这个阶段有助于标记哪些边会被解块。如果标准规定只有特定大小的边缘将被去块,那么其他边缘被标记为非过滤。在 H.265 中,去块仅在 8x8 块边界上完成,而在 VP9 中,它在 4x4、8x8、16x16 和 32x32 变换块边界上完成。类似地,在某些条件下,图片边界或瓦片边界像素上的样本可被排除在去块之外。

This is the first processing stage wherein the edges within the block are identified and marked for deblocking. Not all edges will be deblocked and this stage helps in marking which ones will be. If the standard prescribes that only edges of a certain sizes will be deblocked then other edges are marked for non-filtering. In H.265, deblocking is done only on 8x8 block boundaries whereas in VP9, it is done along 4x4, 8x8, 16x16 and 32x32 transform block boundaries. Similarly, under certain conditions, samples that are on the picture boundary or tile boundary pixels may be excluded from deblocking.

9.2.1.2 导出过滤器强度

一旦识别出需要过滤的边缘,下一步就是使用边界强度参数来确定每个边缘需要多少过滤。该参数通常通过检查边缘每一侧的一组像素以及检查边缘的特征来确定,例如相邻块使用的预测模式、运动矢量和变换系数。

Once the edges that need filtering are identified, the next step is to determine how much filtering is required for each edge by using the boundary strength parameter. This parameter is usually determined by inspecting a set of pixels on each side of the edge and also inspecting the characteristics of the edge, such as the prediction modes, motion vectors and transform coefficients used by the neighboring blocks.

这里的第一个想法是识别具有高块失真概率的边界区域,例如帧内编码块或具有编码变换系数的块之间的边界。在这些区域,需要应用更强的过滤。

The first idea here is to identify boundary areas with high probability of blocking distortion such as the boundary between intra coded blocks or blocks with coded transform coefficients. In such areas, stronger filtering will need to be applied.

此外,源中对象周围的真实边缘在边缘边界上也有显着的像素变化,需要保留而不是过滤。这是通过结合相应的块 QP 值查看边界像素方差来衡量的。当 QP 值较小时,由于块处理引起的边界失真可能较低,因此边界上的任何显着梯度都可能是源中需要保留的边缘。相反,当 QP 很大时,很可能会阻塞失真,因此需要应用更强的滤波。

Additionally, real edges around objects in the source also have significant pixel variance across the edge boundaries and will need to be preserved and not filtered. This is gauged by looking at the boundary pixel variance in conjunction with the corresponding block QP values. When the QP value is small, it is likely that boundary distortions due to block processing are low and therefore any significant gradient across the boundary is likely an edge in the source that needs to be preserved. Conversely, when the QP is large there is a strong possibility of blocking distortion that warrants a stronger filtering be applied.

9.2.1.3 应用过滤

最后一步是沿所有水平和垂直边缘实际应用过滤,这些边缘已在前面的步骤中确定用于过滤。该步骤中对应的计算是决定边缘周围有多少像素将被解块和调整。这是根据第 2 阶段的过滤器强度决定完成的。如果确定需要强过滤,则会影响更多像素,反之亦然。H.265过滤边缘两侧最多 3 个像素的亮度和两侧各一个像素的色度。VP9 具有扩展模式,可在每侧过滤多达 7 个像素的亮度和色度。

The final step is to actually apply the filtering along all horizontal and vertical edges that have been identified for filtering in the earlier steps. A corresponding calculation in this step is to decide how many pixels around the edges will be deblocked and adjusted. This is done based on the filter strength decision from stage 2. If it’s determined that a strong filtering is needed, more pixels are affected and vice versa. H.265 filters up to 3 pixels on either side of the edge for luma and one pixel on either side for chroma. VP9 has extended modes that filter up to 7 pixels per side for both luma and chroma.

9.2.2 过滤示例

图79和80通过示例说明了去块的视觉效果。这些以 CIF 分辨率显示 akiyo 新闻阅读器剪辑的第 260 帧。该剪辑使用 x265 H.265编码器以 100 kbps 的低比特率进行编码。这将具有很高的 QP。图 79 显示了启用去块滤波器的编码剪辑。

The figures 79 and 80 illustrate the visual effects of deblocking through an example. These show the frame number 260 of the akiyo newsreader clip at CIF resolution. The clip is encoded using x265 H.265 encoder at a low bit rate of 100 kbps. This would have a high QP. Figure 79 shows the encoded clip with deblocking filter enabled.

图像

图 79:akiyo 剪辑以 100 kbps 编码并去块.

Figure 79: akiyo clip encoded at 100 kbps with deblocking.

图像

图 80 :akiyo 剪辑以 100 kbps 编码并去块禁用。

Figure 80: akiyo clip encoded at 100 kbps with deblocking disabled.

来源: https: //media.xiph.org/video/derf/

source: https://media.xiph.org/video/derf/

图 80 显示了禁用滤波器时的输出。虽然去块过滤器不一定能软化所有块伪影,但它可以显着改善视觉体验。尤其是在眼睛周围和脸部的下巴区域以及衣服褶皱处的锯齿状区域。应该注意的是,在低比特率下,应该注意平衡过滤过程和内容中真实边缘的保留。

Figure 80 illustrates the output with the filter disabled. While the deblocking filter does not necessarily soften all the blocking artifacts, it has worked to significantly improve the visual experience. This can be seen especially around the eyes and chin areas of the face and jaggies along the folds in the clothes. It should be noted that at low bit rates care should be taken to balance the filtering process and the preservation of real edges in the content.

9.3 SAO

样本自适应偏移(SAO) 滤波器是第二级滤波器。去块后在 H.265 中专门用于循环内过滤器被应用。这在下面的图 81 中进行了说明。虽然去块滤波器主要在变换块边缘上运行以修复块效应,但 SAO 滤波器用于消除振铃伪影并减少重建之间的平均失真和原始图片[1]。因此,这两个过滤器以互补的方式工作以提供良好的累积视觉效果。

The sample adaptive offset (SAO) filter is the second stage filter. It is used in-loop exclusively in H.265 after the deblocking filter is applied. This is illustrated in Figure 81, below. While the deblocking filter primarily operates on transform block edges to fix blocking artifacts, the SAO filter is used to remove ringing artifacts and to reduce the mean distortion between reconstructed and original pictures [1]. Thus, the two filters work in a complementary fashion to provide good cumulative visual benefit.

在信号处理中,涉及频域截断的低通滤波操作会导致时域出现振铃效应。在现代编解码器中,基于有限块的变换​​导致的高频分量丢失会导致振铃。由于高频对应于尖锐的过渡,因此在边缘周围特别容易观察到振铃。振铃伪影的另一个来源是在 H.265和 VP9中使用具有大量抽头的插值滤波器。

In signal processing, low-pass filtering operations that involve truncating in the frequency domain cause ringing artifacts in the time domain. In modern codecs, the loss of high-frequency components resulting from finite block-based transforms results in ringing. As high frequency corresponds to sharp transitions, the ringing is particularly observable around edges. Another source of ringing artifact is the use of interpolation filters with a large number of taps that are used in H.265 and VP9.

图像

图 81:带环路去块的视频解码流水线和 SAO 过滤器。

Figure 81: Video decoding pipeline with in-loop deblocking and SAO filters.

SAO 过滤器通过修改重建的像素。它首先将区域划分为多个类别。对于每个类别,都会计算一个偏移量。然后,过滤器有条件地将偏移量添加到每个类别中的每个像素。应该注意的是,过滤器可以对区域中的每个样本使用不同的偏移量。这取决于样本分类。此外,过滤器参数可能因地区而异。通过这样做,它减少了已识别区域的平均样本失真。SAO 滤波器偏移值也可以通过使用除区域平均样本失真最小化之外的任何其他标准来生成。H.265中规定了两种SAO模式,即边缘偏移模式带偏移模式. 边缘偏移模式依赖于基于当前像素和相邻像素的方向信息,而带偏移模式在不依赖于相邻样本的情况下运行。让我们探讨这两种方法背后的概念。

The SAO filter provides the fix by modifying the reconstructed pixels. It first divides the region into multiple categories. For each category, an offset is computed. The filter, then, conditionally adds the offset to each pixel within every category. It should be noted that the filter may use different offsets for every sample in a region. It depends on the sample classification. Also, the filter parameters can vary across regions. By doing this, it reduces the mean sample distortion of the identified region. The SAO filter offset values can also be generated by using any other criterion other than minimization of the regional mean sample distortion. Two SAO modes are specified in H.265, namely, edge offset mode and band offset mode. While the edge offset mode depends on directional information based on the current pixels and the neighboring pixels, the band offset mode operates without any dependency on the neighboring samples. Let us explore the concepts behind each of these two approaches.

9.3.1 边缘偏移模式

图像

图 82:HEVC 中边缘偏移 SAO 滤波器的四个 1D 图案。

Figure 82: Four 1D patterns for edge offset SAO Filter in HEVC.

在 H.265 中,边缘偏移模式使用 4 种方向模式(使用相邻像素),如图 82(Fu, et al., 2012)[ 1]所示。编码器使用率失真优化为每个 CTU 选择这些方向之一。

In H.265, the edge offset mode uses 4 directional modes (using neighboring pixels) as shown in Figure 82 (Fu, et al., 2012) [1]. One of these directions is chosen for every CTU by the encoder using rate distortion optimization.

图像

图 83:用于识别局部谷、峰、凹角或凸角的像素分类[1]

Figure 83: Pixel categorization to identify local valley, peak, concave or convex corners [1].

对于每种模式,分析 CTU 中的像素以查看它们是否属于四个类别之一,即 1) 局部谷、2) 凹角、3) 局部峰或 4) 凸角。如图 83 所示。由于类别 (1) 和 (2) 的当前像素与其相邻像素相比处于局部最小值,因此这些类别的正偏移将消除局部最小值。负偏移量反过来会提高最小值。另一方面,对于类别 (3) 和 (4),效果将相反,其中负偏移导致平滑而正偏移导致锐化。

For every mode, the pixels in the CTU are analyzed to see if they belong to one of the four categories, namely, 1) local valley, 2) concave corner, 3) local peak or 4) convex corner. This is shown in Figure 83. As categories (1) and (2) have the current pixels in a local minimum compared to their neighboring pixels, positive offsets for these categories would smooth out the local minima. A negative offset would in turn work to sharpen the minima. The effects, on the other hand, would be reversed for categories (3) and (4) where negative offsets result in smoothing and positive offsets result in sharpening.

9.3.2 波段偏移模式

在波段偏移 (BO) 模式下,样本值范围被平均分为 32 个波段(对于 8 位像素),每个波段有 8 个值。对于每个波段,BO 是原始和重建之间的平均差异样本被计算并在比特流中发送。为了降低复杂性和信令,在 HEVC 中,并不是所有的频带偏移都被发信号通知。相反,只有四个偏移对应于像素值在 72 到 104 之间的波段。如果样本属于任何其他波段,则不应用 BO。

In band offset (BO) mode, the sample value range is equally divided into 32 bands (for 8-bit pixels), each with eight values. For every band, the BO that is the average difference between the original and reconstructed samples is calculated and sent in the bitstream. To reduce complexity and signaling, in HEVC, not all band offsets are signaled. Instead, only four offsets corresponding to bands with pixel values between 72 to 104 are signaled. If a sample belongs to any other band, BO is not applied.

图像

图 84:HEVC 中 BO 的图示,其中虚线是原始样本,实线是重建样本样品。

Figure 84: Illustration of BO in HEVC, where the dotted curve is the original samples and the solid curve is the reconstructed samples.

图84 [1]中,横轴表示样本位置,纵轴表示样本值。虚线曲线是原始样本。实曲线是重建的样本。正如我们在这个例子中看到的,对于这四个波段,重建样本略微向原始样本的左侧移动。这会导致负误差,可以通过为这四个频段发出正 BO 信号来纠正这些误差。

In Figure 84 [1], the horizontal axis denotes the sample position and the vertical axis denotes the sample value. The dotted curve is the original samples. The solid curve is the reconstructed samples. As we see in this example, for these four bands, the reconstructed samples are shifted slightly to the left of the original samples. This results in negative errors that can be corrected by signaling positive BOs for these four bands.

9.3.3 SAO实施

可以使用率失真优化技术来实现 SAO。一系列偏移值被仔细选择并添加到 pre-SAO 像素。源样本与重建后 SAO 之间的失真然后计算样本。这是针对所有波段或 EO 类的所有选定偏移量完成的。EO 模式的最佳偏移和边缘偏移类以及每个波段的最佳偏移(在 BO 的情况下)然后通过选择最小化 RD 成本的那些来获得。然后可以通过计算最佳 SAO 后成本与 SAO 前成本之间的差异来计算整个 CTU 的增量成本。如果此增量成本为负,则为 CTU 启用 SAO 滤波器。

The implementation of SAO can be done using rate-distortion optimization techniques. A range of offset values is carefully selected and added to the pre-SAO pixels. The distortion between the source samples and post-SAO reconstructed samples is then calculated. This is done for all selected offsets across all bands or EO classes. The best offset and edge offset class for EO mode and the best offset for each band, in case of BO, are then obtained by choosing the ones that minimize the RD cost. The delta cost for the entire CTU can then be calculated by computing the difference between the best post-SAO cost and the pre-SAO cost. If this delta cost is negative, then the SAO filter is enabled for the CTU.

9.4 概括
● 解块过滤器应用于重建像素去除块边缘边界周围的块伪影,从而提供提高视觉质量和预测性能的双重优势。
● 在 HEVC 中,环路滤波在内部流水线化为两个阶段,第一阶段去块滤波器后跟 SAO 滤波器。
● 去块滤波器的实现在它们操作和改进的像素数量上有所不同。VP9 提供多达四个解块对边缘两侧最多 7 个像素进行操作的过滤器选项。HEVC 提供了两个过滤器,可以修改块边界两侧最多 3 个像素。
● H.265中规定了两种SAO模式,即边缘偏移模式(EO)和频带偏移模式(BO)。
● EO 模式取决于从当前像素和相邻像素导出的方向信息。BO 模式运行时不依赖于相邻样本。
9.5 笔记
  1. Fu C、Alshina E、Alshin A 等人。HEVC 标准中的样本自适应偏移。IEEE 跨电路系统视频技术。2012;22(12):1755-1764。http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.2725&rep=rep1&type=pdf 。2018 年 9 月 22 日访问。
  2. Fu C, Alshina E, Alshin A, et al. Sample adaptive offset in the HEVC standard. IEEE Trans Circuits Syst Video Technol. 2012;22(12):1755-1764. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.352.2725&rep=rep1&type=pdf. Accessed September 22, 2018.

 

 

10 模式决策和速率控制

图像

 

 

比特率是用于对一秒视频进行编码的数据量(以比特为单位)。它通常以兆比特每秒 (Mbps) 或千比特每秒 (kbps) 表示。比特率是视频的一个关键参数,会影响视频的文件大小和整体质量。通常,比特率越高,可用于编码视频的比特数就越多。这意味着更好的视频质量但通常是以更大的文件大小为代价的。当应用程序向编码器提供目标比特率时,编码器的工作是在视频序列中智能地分配可用比特,跟踪视频的复杂性并提供尽可能最佳的图像质量。

Bitrate is the amount of data in bits that is used to encode one second of video. It is usually expressed in megabits per second (Mbps) or kilobits per second (kbps). Bitrate is a critical parameter of the video and affects the file size and overall quality of the video. In general, the higher the bitrate, the more bits there are available to encode the video. This means better video quality but usually comes at the expense of bigger file size. When the application provides a target bitrate to the encoder, the job of the encoder is to allocate the available bits intelligently across the video sequence, keeping track of the video complexity and delivering the best possible picture quality.

在内部,视频编码过程非常复杂,因为它涉及计算密集型数学运算的组合以及在各个阶段做出复杂决策的需要。众所周知,视频编码是一个有损过程,损失主要由量化过程引入。一开始需要多少loss,或者说需要多少压缩比,是由整体的编码设置和参数决定的。这些设置包括可用的信道容量和所需的编码模式。这些参数包括输出比特率和延迟设置。这些是编码器运行的约束条件。然而,实际的编码过程要复杂得多,因为编码器必须在所有级别上做出决定,从 GOP 一直到超级块和子块。通过了解图片的复杂性,编码器必须为视频序列中每个图片中的每个块确定最佳图片类型、运动矢量和预测模式。这个复杂的过程叫做模式决策. 然后还必须使用最佳位数对每个块进行编码。在序列的不同部分之间分配比特的过程称为速率控制.

Internally, the video encoding process is highly complex because it involves a combination of computationally intensive mathematical operations together with the need to take complex decisions at various stages. As we know, video encoding is a lossy process and the loss is introduced primarily by the quantization process. At the outset, how much loss is required, or, in other words, how much compression ratio is needed, is determined by the overall encoding settings and parameters. The settings include channel capacity available and encoding mode required. The parameters include output bit rates and latency settings. These are the constraints under which the encoder operates. However, the actual encoding process is much more complex, in that the encoder must take decisions on all levels, starting from a GOP all the way down to superblocks and sub-blocks. By understanding picture complexity, the encoder has to determine the optimal picture types, motion vectors and prediction modes for every block in every picture in the video sequence. This complex process is called mode decision. Every block then also has to be encoded using an optimal number of bits. The process of assignment of bits across different parts of the sequence is called rate control.

在本章的第一节中,我们将处理模式决策的过程. 本章的后半部分将介绍速率控制方面的主题. 我们还将在本章中看到这些过程如何相互交织。

In the first section of this chapter, we will deal with the process of mode decision. The latter half of the chapter will cover topics in rate control. We will also see in this chapter how these processes are intertwined with one another.

10.1 约束条件

给定特定设置,包括比特率、延迟等,任何编码器面临的根本挑战是如何优化输出编码图像质量,以便它可以:

Given specific settings, including bitrate, latency and so on, the fundamental challenge for any encoder is how to optimize the output-encoded picture quality such that it can either:

A) 最大化输出视频质量对于给定的比特率约束,或
二) 最小化给定输出视频质量的比特率.

在执行上述操作时,编码器还必须确保它在其多个约束内运行。其中一些概述如下:

While doing the above, the encoder must also ensure that it operates within its several constraints. Some of these are outlined below:

比特率。编码器必须确保它根据此设置生成平均比特率。还可以施加额外的约束,使得它也可能需要在一组最大和最小比特率限制内操作。在恒定比特率模式下尤其如此,在这种模式下,信道容量通常是固定的。

Bitrate. The encoder has to ensure it produces an average bitrate per this setting. Additional constraints may also be imposed, such that it may also be required to operate within a set of maximum and minimum bitrate limits. This is especially the case in constant bitrate mode where, usually, the channel capacity is fixed.

潜伏. 延迟被定义为图片输入到编码器和从解码器输出并可供显示之间消耗的总时间。该间隔取决于编码流水线阶段的数量、编码器流水线中各个阶段的缓冲区数量以及编码器如何使用其内部缓冲机制处理各种图片类型(例如 B 图片)等因素。该间隔还包括来自解码器的相应操作。术语延迟通常是指编码器和解码器的组合延迟。

Latency. Latency is defined as the total time consumed between the picture being input to the encoder and being output from the decoder and available for display. This interval depends on factors like the number of encoding pipeline stages, the number of buffers at various stages in the encoder pipeline and how the encoder processes various picture types, such as B-pictures, with its internal buffering mechanisms. The interval also includes the corresponding operations from the decoder. The term, latency, usually refers to the combined latency of both the encoder and the decoder.

缓冲空间。当解码器接收到编码比特流时,它将其存储在缓冲区中。在那里,解码器消除了比特率的变化,以便以恒定的时间间隔提供解码输出。相反,该缓冲区还定义了编码器在其比特率可变性方面具有的灵活性,无论是瞬时的还是在任何定义的时间间隔内。因此,任何时候的缓冲区满度都是编码的位与从缓冲区中移除的恒定速率(对应于目标比特率)之间的差异。缓冲区的下边界为零,上边界为缓冲区容量。H.264和 H.265定义了一个假设的参考解码器(HRD) 缓冲模型。该模型用于模拟解码器缓冲区的充满度。这有助于速率控制在产生一个兼容的比特流。

Buffer Space. When the decoder receives the encoded bitstream, it stores it in a buffer. There, the decoder smooths out the variations in the bitrate so as to provide decoded output at a constant time interval. Conversely, this buffer also defines the flexibility that the encoder has, in terms of the variability of its bitrate, both instantaneously and at any defined interval of time. The buffer fullness at any time is thus a difference between the bits encoded and a constant rate of removal from the buffer that corresponds to the target bitrate. The lower boundary of the buffer is zero and the upper boundary is the buffer capacity. H.264 and H.265 define a hypothetical reference decoder (HRD) buffer model. This model is used to simulate the fullness of the decoder buffer. This aids the rate control in producing a compliant bitstream.

因此,编码器必须严格控制在任何时间段内发送的位数。这是为了确保解码器缓冲区永远不会满或空。对于内存缓冲区通常有限的硬件解码器尤其如此。当解码器缓冲区已满时,无法容纳更多位,并且可能会丢弃传入的位。另一方面,如果解码器已经消耗了所有的比特并且缓冲区变空,那么除了最后解码的图片之外它可能没有任何东西可以显示。这可能表现为输出视频中出现不需要的停顿。

The encoder thus has to tightly regulate the number of bits sent in any period of time. This is to ensure that the decoder buffers are never full or empty. This is especially true for hardware decoders that often have limited memory buffers. When the decoder buffer is full, no further bits can be accommodated, and incoming bits may be dropped. On the other hand, if the decoder has consumed all the bits and the buffer becomes empty, it may not have anything to display except the last decoded picture. This may manifest as undesirable pauses in the output video.

编码速度。通常,编码应用程序被分为实时编码或非实时编码,大多数编码器都是为其中一种设计的。实时编码应用程序的示例包括直播事件广播。在这里,摄像机信号到达工作室,在那里经过卫星、电缆或互联网实时处理​​、编码和流式传输。在实时编码中,如果输出帧率为 60fps,则编码器必须确保在其操作的每一秒内能够产生 60 帧的编码输出。非实时编码,或离线编码,有足够的时间来执行额外的处理以提高编码质量。一个典型的例子是视频点播流,其中对所有离线内容进行编码并存储在服务器中,然后获取请求的视频,

Encoding Speed. Typically, encoding applications get classified as either real-time or non real-time encoding and most encoders are designed for one or the other. Examples of real-time encoding applications include live event broadcasting. Here, the camera feed reaches the studios where it’s processed, encoded and streamed over satellite, cable or the internet in real time. In real-time encoding, if the output frame rate is 60fps, then the encoder has to ensure it can produce an encoded output of 60 frames in every second of its operation. Non-real time encoding, or, offline encoding, has the luxury of time to perform additional processing in an effort to improve the encoding quality. A typical example is video on demand streaming where the encoding is done on all the content offline and stored in servers and the requested video is fetched, streamed and played back upon demand.

在这些约束下运行,编码器必须在所有阶段做出决定,包括选择图片类型、选择编码块的分区类型、选择预测模式、运动矢量和它们指向的相应参考图片、过滤模式、变换大小和模式、量化参数等。通过在各个阶段做出这些决定,编码器努力优化各种图片内和跨图片的比特消耗,以提供最佳的输出图片质量。这个视频质量通过比较重建的客观测量使用称为失真测量的数学公式将视频(编码和解码)转换为原始输入视频这通常是逐个像素计算的,并对帧进行平均。由于失真度量是重建帧或块与原始帧或块有多大不同的指示,因此该数字越高,与所选块相关联的质量越差,反之亦然。

Operating within these constraints, the encoder has to take decisions at all stages, including selecting picture types, selection of partition types for the coding blocks, selection of prediction modes, motion vectors and the corresponding reference pictures they point to, filtering modes, transform sizes and modes, quantization parameters, and so on. By taking these decisions at various stages, the encoder strives to optimize the bit spend both within and across various pictures to provide the best output picture quality. This video quality is measured objectively by comparing the reconstructed video (encoded and decoded) to the original input video using a mathematical formula called distortion measure that is usually computed pixel-by-pixel and averaged for the frame. As the distortion measure is an indication of how much different the reconstructed frame or block is from the original, the higher this number the worse the quality associated with the selected block and vice versa.

10.2 失真措施

在本节中,我们将探讨两种广泛使用的失真度量,即绝对差之和(SAD) 和绝对变换差之和(SATD)。在这些方法中,逐个像素地执行操作,并且原始和重建中的所有像素图片被使用。由于这些措施用于每个块级别的决策,因此通常在块级别的编码器中执行计算。

In this section, we will explore two widely used distortion measures, namely, the sum of absolute differences (SAD) and sum of absolute transform differences (SATD). In these methods, a pixel-by-pixel operation is performed and all pixels in both the original and the reconstructed picture are used. As these measures are used for decisions at every block level, the computations are usually performed in the encoder at a block level.

10.2.1 绝对差之和

为了达到这个指标,原始图像块中的每个像素与重建图像中的相应像素之间的绝对差块进行计算。这些差异的总和就是块的 SAD。在信号处理中,这对应于L1 范数,并通过以下等式在数学上表示:

To arrive at this metric, the absolute differences between each pixel in the original picture block and the corresponding pixels in the reconstructed block are calculated. The sum of these differences is the SAD for the block. In signal processing, this corresponds to the L1 norm and is expressed mathematically by the following equation:

SAD = n m | c n,m - r n,m |

SAD = n m | cn,m - rn,m |

其中 c n,m对应于当前图片块样本,r n,m对应于重建的块样本。

where cn,m corresponds to the current picture block samples and rn,m corresponds to the reconstructed block samples.

SAD 可用于运动估计,以比较当前像素块与运动矢量指向的块的相似性在重建的用于预测的图片集。

The SAD can be used in motion estimation to compare the similarities of the current block of pixels to the block being pointed to by the motion vector in the reconstructed picture set used for prediction.

10.2.2 SATD(绝对变换差异之和

这是自 H.264 以来一直使用的度量标准,其中引入了 Hadamard 变换以进行残差变换样品。SATD 与 SAD 类似,只是它多了一步计算残差样本的 Hadamard 变换,然后计算 Hadamard 变换残差的绝对值之和。通过在成本计算中加入整数 Hadamard 变换,该度量能够更好地描述使用 Hadamard 变换时对结果残差值进行编码的实际成本。在这种情况下,此措施会在运动估计过程中做出更好的决策。

This is a metric that has been used since H.264 where Hadamard transforms were introduced for transform of residual samples. The SATD is similar to the SAD except that it does an additional step of computing the Hadamard transform of the residual samples and then calculates the sum of absolute values of the Hadamard-transformed residuals. By incorporating the integer Hadamard transform in the cost computations, this metric is better able to portray the actual cost of coding the resulting residual values when Hadamard transforms are used. In such scenarios, this measure results in better decisions in the motion estimation process.

                  T = H。 _ ( C - R )。

                  T = H . (C - R) . HT

SATD = n m | t n,m |

SATD = n m | tn,m |

其中C是对应于当前图片块样本的矩阵,R是重建的矩阵表示块样本。H是 Hadamard 变换矩阵,因此T是残差的 Hadamard 变换的结果绝对值之和产生 SATD 指标的样本。

where C is the matrix corresponding to the current picture block samples and R is the matrix representation of the reconstructed block samples. H is the Hadamard transform matrix and T is thus the result of Hadamard transform of the residual samples whose sum of absolute values results in the SATD metric.

10.3 编码问题的表述

因此,基本的编码问题是一个优化问题,可以表述为输入视频与其重构输出之间的失真最小化正如我们已经看到的那样,视频受到一系列限制,包括比特率和编码延迟。

Thus, the fundamental encoding problem is an optimization problem that can be stated as minimization of the distortion between the input video and its output reconstructed video, subject to a set of constraints including bitrates and coding delay, as we have seen already.

鉴于大量的编码参数,上述最小化问题被分解为更小的最小化问题。

Given the large number of encoding parameters, the above minimization problem is broken down into smaller minimization problems.

如前所述,编码器必须为每个超级块决定或 CTU,使用什么块分区,使用什么编码模式以及选择什么预测参数。必须对视频序列中每个图片中的每个块执行此操作,方法是在每一步都牢记输出比特率和由于先前选择而产生的失真。因此,编码器的任务是为每张图片找到最佳编码模式,使得所选的失真度量(D) 被最小化,同时始终受到速率约束 (R c )。

As mentioned earlier, the encoder has to decide, for each superblock or CTU, what block partitioning to use, what coding mode to use and what prediction parameters to choose. This has to be done for every block in every picture in the video sequence by keeping in mind at every step, the output bit rate and distortion produced as a result of the previous selections. Thus, the task for the encoder is to find the best coding modes for every picture, such that the selected distortion measure (D) is minimized while always subject to the rate constraint (Rc).

在数学上,这可以表示为:

Mathematically, this can be expressed as:

最小 DR < R c

min D with R < Rc

假设我们将视频分解为 n 个块。设 P 表示要对这“n”个块做出的 n 个编码决策的集合。这些编码决策直接影响失真 (D) 及其产生的失真率 (R)。因此,我们可以说“D”和“R”是 P 的函数。约束最小化问题可以表示为:

Let us assume we break down the video into n blocks. Let P represent the set of the n coding decisions to make on these ‘n’ blocks. These coding decisions directly affect the distortion (D) and rate of bits produced (R). Hence, we can say ‘D’ and ‘R’ are functions of P. The constrained minimization problem can be thus represented as:

最小 D (P)R (P) < R c

min D (P) with R (P) < Rc

解决此问题的一种方法是将此约束最小化问题转换为无约束最小化问题使用拉格朗日乘子法的问题。在此方法中,约束函数附加到要最小化的函数,方法是将其与称为拉格朗日乘数的标量相乘. 这变成了拉格朗日函数(比如 J),原始约束问题的解是通过求解最优 P n和一组最优拉格朗日乘数(比如 λ)获得的。这在数学上表示为:

One way to solve this problem is to convert this constrained minimization problem into an unconstrained minimization problem using the method of Lagrange multipliers. In this method, the constraint function is appended to the function to be minimized by multiplying it with a scalar called Lagrange multiplier. This becomes a Lagrangian function (say J) and the solution to the original constrained problem is obtained by solving for both an optimal Pn and an optimal set of Lagrange multipliers (say λ). This is mathematically represented as:

J (P, λ ) = D (P) + λ ⋅ R (P)

J (P, λ) = D (P) + λ ⋅ R (P)

J opt = min J (P, λ )

J opt = min J (P, λ)

为了简化实现,假设n个块的编码决策是相互独立的,因此联合成本J可以导出为n个块的联合成本之和。

To simplify the implementation, it’s assumed that the coding decisions of the n blocks are independent of each other and thus the joint cost J can be derived as the sum of joint costs of the n blocks.

J (P, λ ) = n J n (P n , λ )

J (P, λ) = n Jn (Pn, λ)

该优化问题J(P,λ)的最优解可以通过为n个块 独立选择编码模式P n来获得。在这种简化下,组合最优成本 J opt是 n 个块的最优成本之和。

The optimal solution of this optimization problem J (P, λ) can be obtained by independently selecting the coding modes Pn for the n blocks. Under this simplification, the combined optimal cost Jopt is the sum of optimal costs of the n blocks.

J opt = n min J n (P n , λ )

J opt = n min Jn (Pn, λ)

如果选择特定模式对块进行编码,编码器必须估计将花费的位数,然后计算使用该模式的成本。通过遍历所有模式并计算相应的比特成本,编码器可以分析和选择最能最小化编码成本的模式。由于其有效性和简单性,上述优化技术已被广泛部署。使用这种方法,评估所有模式和测试其最小成本所花费的计算时间会对性能产生重大影响。在一个块中做出的决策往往会在空间和时间上对其他块中做出的决策产生级联效应,而视频编码器的设计涉及通常忽略这些影响的简化。

The encoder has to estimate the number of bits that will be spent if a particular mode were selected to encode the block and then compute the cost of using that mode. By iterating across all the modes and computing the corresponding bit costs, the encoder can analyze and select the mode best able to minimize the cost of encoding. The above optimization technique has been widely deployed, thanks to its effectiveness and simplicity. Using this approach, the computational time spent evaluating all the modes and testing their costs for minimality has a significant performance impact. The decisions made in one block tend to have cascading effects on decisions made in other blocks, both spatially and temporally, and the design of video encoders involve simplifications which usually ignore these effects.

10.4 率失真优化

编码流的总比特率取决于编码器为块选择的模式、运动矢量和用于对量化后的变换残差进行编码的比特。量化系数通常占比特流中的 60-70% 或更多比特,因此是速率控制的重要焦点. 通过增加 Q p,比特率 (R) 会降低,反之亦然。但是,这是以牺牲视频质量为代价的产生的比特流。增加的 Q p通常会导致更大的失真 (D),从而降低视频质量。由此可见,Q p直接影响失真 (D) 和速率 (R) 之间的平衡。因此,控制拉格朗日参数 λ(定义 R 和 D 的相对权重)等同于控制 Q p。因此,拉格朗日参数 λ 被内置到速率控制过程中,以避免编码性能损失,同时有效地保持所需的比特率。速率控制算法的目标是以最低的失真准确地实现目标比特率。            

The overall bit rate of the encoded stream depends on the modes selected by the encoder for the blocks, the motion vectors and the bits used to encode the transformed residuals after quantization. The quantized coefficients account for typically 60-70% or more bits in the bitstream, hence are the significant focus of rate control. By increasing the Qp, the bit rate (R) is reduced and vice versa. However, this comes at the expense of the video quality of the resulting bitstream. Increased Qp typically results in greater distortion (D) and hence lower video quality. It is thus seen that the Qp directly influences the balance between distortion (D) and rate (R). Therefore, control over the Lagrangian parameter λ (which defines the relative weights of R and D) is synonymous with control over the Qp. The Lagrangian parameter λ is thus built into the rate control process to avoid loss in coding performance while effectively maintaining the required bitrate. The goal of the rate control algorithm is to accurately achieve the target bitrate with the lowest distortion.            

下面的图 85 显示了视频的典型 RD 曲线,其中 A 点位于 RD 曲线上,其比特率为 R A。从该图中可以看出,当比特率增加时,失真会减少,反之亦然。在 RD 框架中,速率控制的目标是围绕目标比特率 R 导出最佳操作点,而速率控制算法的挑战是估计 RD 函数。可能有几种方法可以对此进行近似。传统的方法是建立比特率R和QP之间的数学关系,并在QP域进行RD优化。

Figure 85, below, shows a typical R-D curve of a video where point A is on the R-D curve, whose bitrate is RA. As seen from this figure, when the bitrate increases the distortion decreases and vice versa. In an R-D framework, the goal of rate control is to derive the best operating point around the target bitrate R and the challenge for the rate control algorithm is to estimate the R-D function. There could be several ways to approximate this. The traditional approach is to establish a mathematical relationship between bitrate R and QP and perform the R-D optimization in the QP domain.

图像

图 85:率失真曲线。

Figure 85: Rate distortion curve.

应该注意,率失真优化技术至少应用于编码器中的以下三个操作:1) 选择最佳可能预测的运动估计,2) 模式决策同时选择最佳的块模式和分区,以及 3) 速率控制在图片内和图片之间分配位。在这些操作中,Q P仅与直接控制比特率的速率控制操作直接相关。非残差之间的直接关系Q P不存在运动矢量、预测模式等位。它不适用于这些参数。因此,单独使用 Q P对速率建模可能会导致这些过程的成本估算不准确。

It should be noted that the rate distortion optimization technique is applied at least across the following three operations in the encoder: 1) motion estimation in selecting the best possible prediction, 2) mode decision while selecting the best possible block modes and partitions, and 3) rate control to allocate the bits within and across the pictures. Of these operations, QP is directly associated only with the rate control operation wherein it directly controls the bitrate. A direct relationship between non-residual bits like motion vectors, prediction modes, and so on does not exist with QP. It is not applied to these parameters. Therefore, using QP alone to model the rate potentially can result in inaccurate costs estimates for these processes.

然而,在使用率失真优化的所有阶段中通用的一个参数是 λ,并且可以通过利用 λ 和 R 之间的稳健关系来执行更好的比特估计。这反过来可以提供更精确的比特估计每张照片。

However, the one parameter that’s universal across all stages where rate distortion optimization is used is λ and a better estimate of bits could be performed by exploiting the robust relationship between λ and R. This, in turn, could provide a more precise bits estimate for every picture.

10.5 速率控制概念

正如我们所知,码率控制的目标是以最低的失真准确地达到目标码率。速率控制算法保持比特预算。它通过分析视频复杂度并跟踪先前分配的比特和目标比特率,为每张图片和每张图片中的每个块分配比特。速率控制不是编码标准的规范部分,而是区分一个编码器与另一个编码器的关键特征。

As we know, the goal of rate control is to accurately achieve the target bitrate with the lowest distortion. The rate control algorithm maintains a bit budget. It allocates bits to every picture and every block within each picture by analyzing the video complexity and keeping track of previously allocated bits and the target bitrate. Rate control is not a normative part of the coding standard but is a critical feature distinguishing one encoder from another.

随着视频序列中图片复杂度的变化,编码器会保持跟踪并动态更改 QP 等参数,以准确遵循请求的比特率,同时保持图片质量。速率控制的核心算法是建立拉格朗日参数 λ、Q P、比特率和失真度量之间关系的定量模型.

As the picture complexity changes in the video sequence, the encoder keeps track and dynamically changes parameters like QP to accurately follow the requested bitrate while maintaining the picture quality. The heart of the rate control algorithm is a quantitative model that establishes a relationship between the Lagrangian parameter λ, QP, the bitrate and a distortion measure.

速率控制算法包含两个重要的函数或部分。第一个功能是位分配。这考虑了关键要求并分配比特,从 GOP 级比特分配开始,然后在粒度上缩小到基本的单元级比特分配。第二个功能涉及如何为特定单元实现目标比特率。这使用了我们之前讨论过的率失真优化模型之类的模型。在本节中,我们将更详细地研究这两个函数。我们将探讨它们是如何交织在一起并协同工作以帮助实现接近目标比特率的比特率。

Rate control algorithms contain two important functions or sections. The first function is bit allocation. This considers key requirements and allocates bits, starting with GOP-level bits allocation and narrowing down in granularity to a basic unit-level bit allocation. The second function concerns how the target bitrate is achieved for a specific unit. This uses models like the rate distortion optimization models we discussed earlier. In this section, we will delve into these two functions in greater detail. We will explore how they are intertwined and work together in helping to achieve a bitrate that’s close to the target bitrate.

10.5.1 比特分配                  

本节说明了速率控制背后的概念算法通过一种简单的机制在不同的编码级别分配位。在这个方案中,比特分配是在以下三个层次上完成的。

This section illustrates the concepts behind how the rate control algorithms allocate bits at various levels of encoding through a simple mechanism. In this scheme, bit allocation is done at the following three levels.

  1. GOP级比特分配
  2. GOP-level bit allocation
  3. 图片级比特分配
  4. Picture-level bit allocation
  5. 基本单元级位分配
  6. Basic unit-level bit allocation

需要注意的是,第一张图片通常以特殊方式处理,因为编码器通常没有编码内容的先验知识,编码器不可能准确估计第一张图片的比特数。但是,这通常可以通过以下一种方法或多种方法的组合来解决。

It should be noted that the first picture is usually treated in a special way, as the encoder usually has no a priori knowledge of the encoding content and it is impossible for an encoder to accurately estimate the number of bits for the first picture. However, this is usually solved by one or a combination of the following approaches.

  1. 预处理。像前瞻这样的预处理步骤会进行第一遍编码,以提供粗略复杂度和编码参数的估计。
  2. Preprocessing. A preprocessing step like lookahead does a first-pass encoding to provide estimates of crude complexity and encoding parameters.
  3. 用户输入。使用用户指定的编码参数(如第一张图片的初始 QP)意味着编码器不需要计算该图片的目标位。然而,这不是优选的方法。
  4. User Input. Use of user-specified encoding parameters like initial QP for the first picture means that the encoder doesn't need to calculate the target bits for this picture. However, this is not a preferred method.
  5. 近似位估计。一种优于盲初始 Q P的更好方法是根据一组标准(如目标比特率)来估计比特。例如,一种简单的方法是根据目标比特率只分配每张图片平均比特数的五到六倍。该分配将在第一幅图像被帧内编码时发生。
  6. Approximate Bits Estimation. A better approach over a blind initial QP can be to estimate bits based on a set criterion like target bit rate. For example, one simple way is to just allocate about five to six times the average bits per picture based on the target bitrate. This allocation would occur as the first picture is intra coded.

虽然不同的策略可能会导致在较短的初始时间范围内成功或失败地预测所需的位数,但重要的是要注意速率控制的性能应该在更长的时间段内进行测量。该算法有更好的机会在较长时间内适应和调整比特率,以弥补任何不准确的初始比特估计。

While different strategies can result in either successful or failed prediction of the number of bits needed in a short, initial time frame, it is important to note that the performance of rate control should be measured over a longer time period. The algorithm has better opportunities to adapt and adjust the bitrate over the longer term to make up for any inaccurate initial bits estimation.

10.5.1.1 GOP 级位分配

这是顶级位分配。整个 GOP 的目标比特率是基于目标比特率和解码器缓冲区的充满度计算的。由于图片的复杂性会在不同的场景中动态变化,因此在 GOP 中,很难为每个 GOP 分配一个固定的比特目标。更好的方法是将当前 GOP 的比特目标偏移​​前一个 GOP 使用的比特数比目标比特率多多少或少多少。如果之前的 GOP 使用更多或更少的比特,那么当前的 GOP 应该相应地花费更少或更多的比特。增强策略还使用约 30-60 帧的图片滑动窗口,在其中调整比特率,从而提高视频质量调整,更顺畅。滑动窗口的大小大于 GOP 大小,较大的窗口大小可提供更多空间,从而实现更平滑的调整。一旦确定了 GOP 目标比特率,它就会被向下馈送以进行精确的图片比特分配和基本单位比特分配。

This is the top-level bit allocation. A target bitrate for the entire GOP is calculated based on the target bit rate and the fullness of the decoder buffer. As the picture complexity can vary dynamically across various scenes, hence across GOPs, it’s difficult to assign a fixed target for bits to every GOP. A better approach is to offset the bits target of the current GOP by how many more or fewer bits the previous GOPs used, compared to the target bitrate. If the previous GOPs used more or fewer bits, the current GOP should correspondingly cost fewer or more bits. Enhanced strategies also use a sliding window of pictures of about 30-60 frames in which adjustments are made to make the bitrate, hence the video quality adjustment, smoother. The size of the sliding window is bigger than the GOP size and a larger window size provides more room leading to smoother adjustments. Once the GOP target bitrate is determined, it is fed down the layers for accurate picture bit allocation and basic unit bit allocation.

10.5.1.2 图像级位分配

GOP 中剩余可用的比特由图片级比特分配算法统一分配给图片,或者根据分配的图片权重分配。已探索的一种方法是分层位分配。在这种方法中,不同的图片属于几个预定级别之一,并且每个图片根据其级别权重分配位。已发现分层位分配可实现性能改进,因为它将位分配与图片的其他编码参数对齐。

Bits that are left available in the GOP are assigned by the picture-level bit allocation algorithm to the pictures, either uniformly or in accord with an assigned picture weight. One method that has been explored is hierarchical bit allocation. In this method, different pictures belong to one of several predetermined levels and every picture is assigned bits in accordance with its level weight. Hierarchical bit allocation has been found to achieve performance improvements because it aligns the bit allocation with other coding parameters of the pictures.

图像

图 86:分层图片级位分配方案。

Figure 86: Hierarchical picture level bit allocation scheme.

图 86显示了 HEVC 软件中可用的典型分层比特分配方案。在该方案中,以分层模式分配了三个级别。图片Pic 4n和Pic 4(n+1)属于第一层。级别 1 具有导致最多位的图片级 QP 值。图片Pic 4n+1和Pic 4n+3属于第三层。级别 3 的 QP 增加最多,导致比特数最少。图片Pic 4n+2属于第二层。它具有中级 QP 调整,因此在其他两个级别之间有一点消耗。                  

Figure 86 shows a typical hierarchical bit allocation scheme available in HEVC software. In this scheme there are three levels assigned in a hierarchical pattern. Pictures Pic4n and Pic4(n+1) belong to the first level. Level 1 has a picture-level QP value that results in the most bits. Pictures Pic4n+1 and Pic4n+3 belong to the third level. Level 3's QP has been increased the most, resulting in the fewest bits. Picture Pic4n+2 is belongs to the second level. It has a mid-level QP adjustment and therefore a bit consumption between the other two levels.                  

10.5.1.3 基本单元级位分配

通过这种方法,可扩展的速率控制在每张图片中可以使用不同级别的粒度。粒度级别可以是切片、块或任何连续的块集。此粒度级别称为速率控制的基本单元 (BU),通常使用不同的 QP 值。BU级比特分配算法与图片级比特分配算法非常相似。它根据BU的权重将剩余的比特分配给剩余的BU。权重可以是预先确定的或者也可以是动态计算的。后者可能采用复杂性度量,例如估计的平均差 (MAD) 属于相同图像级别的先前编码图像中并置 BU 的预测误差。理想情况下,将在对当前图片进行编码之后计算 MAD。然而,这将需要我们在选择相应的 QP 后再次对图片进行编码。相反,我们假设这种复杂性度量在图片之间逐渐变化,并且我们使用属于同一级别的先前图片的近似值。

With this approach, a scalable rate control to different levels of granularity is possible within each picture. The level of granularity could be a slice, a block, or any contiguous set of blocks. This granular level is called a basic unit (BU) of rate control for which usually distinct values of QP are used. The BU level bit allocation algorithm is quite similar to the picture-level bit allocation algorithm. It allocates the leftover bits to the remaining BUs in accord with the weight of the BU. The weights can be predetermined or also may be calculated dynamically. The latter may employ a complexity metric such as the estimated mean average difference (MAD) of the prediction error of the collocated BU in the previously coded picture belonging to the same picture level. Ideally, the MAD would be calculated after encoding the current picture. However, that would require us to encode the picture again after the corresponding QP is selected. Instead, we assume that this complexity metric varies gradually across pictures and we use an approximation from the previous pictures belonging to the same level.

10.5.2 速率控制中的 RDO

在上一节中,我们解释了如何在图片和 BU 级别分配位。使用目标速率信息和相应的复杂性(失真)度量,编码器现在必须计算用于编码的 λ 值。

In the previous section, we explained how bits are allocated at both the picture and the BU level. Using the target rate information and a corresponding complexity (distortion) measure, the encoder has to now compute the λ value used for encoding.

10.5.2.1 λ的确定

正如我们之前解释的那样,率失真模型通过使用数学模型从图片或 BU 的目标比特率导出 λ 值来实现这一点。应该注意的是,模型使用的初始值不是固定的。不同的序列可能具有完全不同的建模值,并且这些值也会随着编码的进行而更新。因此,模型动态适应。在对一个 BU 或图片进行编码后,实际编码的比特率用于更新其内部参数并相应地导出未来更新的 λ 值。

The rate-distortion model, as we explained earlier, does just this by using a mathematical model to derive the λ value from the target bitrate for a picture or BU. It should be noted that the initial values used by the model are not fixed. Different sequences may have quite different modeling values and these values will also get updated as the encoding progresses. Thus, the model adapts dynamically. After encoding one BU or picture, the actual encoded bitrate is used to update its internal parameters and correspondingly derive future updated λ values.

10.5.2.2 RDO编码

H.264 、H.265和 VP9等编码标准提供多种分区大小以及帧间和帧内预测模式,包括直接或跳过模式。编码中的一个关键功能是确保选择正确的模式。这对于解锁显着的比特率降低至关重要。然而,通过所有不同的组合并得出最佳决策是以增加计算复杂性为代价的。选择最佳模式的一种方法是使用前面部分描述的率失真优化 (RDO) 机制。这使用 λ 值、目标比特率和失真度量来执行以下操作:

Coding standards like H.264, H.265 and VP9 provide several partition sizes and inter and intra prediction modes, including direct or skip modes. A critical function in encoding is to ensure that the correct modes are selected. This is crucial in unlocking significant bit rate reductions. However, going through all the different combinations and coming up with optimal decisions comes at the expense of increased computational complexity. One way to select the optimal modes is to use the rate-distortion optimization (RDO) mechanism that was described in an earlier section. This uses the λ value, the target bitrate, and distortion metrics to do the following:

1) 对所有模式进行详尽计算以确定使用的比特和每种模式的相应失真,

1) perform an exhaustive calculation of all modes to determine the bits used and corresponding distortion of each mode,

2) 使用其内部模型计算一个度量,该度量将为每种模式计算的比特率和失真作为输入,以及

2) use its internal model to compute a metric that takes as input the bitrate and distortion calculated for every mode, and

3)选择最小化度量的模式。

3) select the mode that minimizes the metric.

因此,一旦确定了图片或 BU 的 λ 值,就可以通过使用穷举 RDO 搜索来确定所有编码参数,包括分区和预测模式以及运动向量。QP 值也可以通过详尽的 QP 优化来确定,以实现最佳的 RD 性能,或者可以使用模型对其进行简化和导出。

Thus, once the λ value is determined for a picture or BU, all the coding parameters including partition and prediction modes and motion vectors can be determined by using exhaustive RDO search. The QP value could be also determined by an exhaustive QP optimization to achieve the optimal RD performance or it could be simplified and derived using a model.

需要注意的是,RDO 过程是速率控制的补充过程,因为它不直接控制 QP 值。RDO 和速率控制的相互作用有时被视为先有鸡还是先有蛋的问题,因为 RDO 实际上会影响速率控制算法。正如我们之前看到的,失真度量速率控制 BU 级比特分配算法需要 MAD。但是,在使用 RDO 计算所有预测模式并由此计算残差之前,它不能准确可用。这就是速率控制算法使用来自先前编码图片的 MAD 估计值的原因。因此,这两个过程根据需要使用近似解耦,以保持解决方案在计算上可行。保持一致的视频质量和经验,λ 和 QP 都不应该急剧变化。然后它们通过 QP 限制器步骤,限制它们在图片和 BU 级别上的变化。

It should be noted that the RDO process is complementary to the rate control process as it does not directly control the QP value. The interplay of RDO and rate control is sometimes seen as a chicken and egg problem, because RDO in effect influences the rate control algorithm. As we saw earlier, the distortion measure MAD is needed by the rate control BU-level bit allocation algorithm. However, it’s not accurately available until all the prediction modes and thereby residuals are computed using RDO. This is why the rate control algorithm uses an estimate for MAD from previously encoded pictures. Thus, these two processes are decoupled using approximations as needed to keep the solution computationally feasible. To maintain a consistent video quality and experience, both λ and QP should not change drastically. They are then passed through a QP limiter step that limits their variations both at a picture and a BU level.

10.5.3 速率控制机制总结

图 87 改编自标题为速率控制和 H.264 [1]的白皮书提供了速率控制的总体总结 前面章节讨论过的机制。这个过程的输入是:

Figure 87 is adapted from the whitepaper titled Rate control and H.264 [1]. It provides an overall summary of the rate control mechanism that has been discussed in previous sections. The inputs to this process are:

A) 通常由应用程序或通过用户输入提供的目标比特率,
二) 缓冲容量,和
C) 初始缓冲区占用情况。

图像

图 87:速率控制器机制的元素。

Figure 87: Elements of the rate controller mechanism.

在编码过程开始时,所需的比特率输入被馈送到虚拟缓冲区模型(如果存在)。然后,这会提供有关缓冲区充满度的信息。在编码过程中,始终通过监视到目前为止已编码的总比特数和从缓冲区中删除比特的速率来跟踪缓冲区满度。缓冲器充满度信息连同目标比特率被作为输入提供给比特分配单元。然后将这些用于计算 GOP 级别、图片级别和基本单元级别的目标位。

At the start of the encoding process, the demanded bitrate input is fed to the virtual buffer model (if present). This then provides information on buffer fullness. During the process of encoding, buffer fullness is always kept track of by monitoring the total bits that have been encoded thus far and the rate of removal of the bits from the buffer. The buffer fullness information, along with the target bitrate, are fed as input to the bit allocation units. These are then used to compute the GOP level, picture level and basic unit-level target bits.

BU 级目标位,以及空间图片复杂度信息的替代项,如 MAD(通常从以前的图片中存储),然后用作 RD 模型中的速率和失真计算的输入。RD 模型还将初始 QP 值作为输入。这将根据目标比特率计算并根据运行比特率进行更新。RD模型的输出是相应的目标QP值来编码BU。该目标 QP 值然后通过 QP 限制器块,该块分析 QP 值与先前 QP 值的比较。这可确保提供平滑过渡,并消除 QP 值中的任何显着变化。QP 限制器块的输出是 BU 的最终目标 QP。

The BU-level target bits, along with a surrogate for spatial picture complexity information like MAD (usually stored from previous pictures), are then used as the inputs for the rate and distortion calculations in the R-D model. The R-D model also takes as an input an initial QP value. This will have been computed based on the target bitrate and updated according to the running bitrate. The output of the R-D model is a corresponding target QP value to encode the BU. This target QP value then passes through a QP limiter block that analyzes the QP values over previous QP values. This ensures that a smooth transition is provided and any dramatic changes in the QP values are smoothed out. The output of the QP limiter block is the final target QP for the BU.

当 BU 被量化和编码时,以下参数被反馈到模型中以供将来计算。

When the BU is quantized and encoded, the following parameters are fed back into the model for future computations.

A) 为 BU 编码的总比特数
二) 为 BU 编码的剩余位
C) 实际残差价值观

总位参数用于更新虚拟缓冲区中的缓冲区满度。残差更新位参数以向 RD 模型提供准确的速率信息。实际预测残差被反馈到复杂性估计器 (MAD)。

The total bits parameter is used to update the buffer fullness in the virtual buffer. The residual bits parameter is updated to provide accurate rate information to the RD model. The actual prediction residuals are fed back into the complexity estimator (MAD).

该框架使用图片级别的目标图片分配机制灵活地将QP和相应的比特分配给不同的图片类型。

This framework allocates QPs and corresponding bits to different picture types flexibly using the target picture allocation mechanism at the picture level.

10.6 自适应量化 (AQ)

也有可能在同一图片中的不同 BU 之间获得显着的 QP 变化,这与它们的复杂性变化一致。这种调整称为自适应量化 (AQ)比起速率控制机制,它更像是一种视觉质量微调工具。现代编码器实现具有内置机制,可以在编码之前分析传入的图片。这些预分析工具,也称为前瞻,通常执行空间和时间场景内容分析。通过使用客观指标,他们提供场景复杂性信息。AQ 算法利用此场景复杂性信息在不同 BU 之间优化分配比特,以提供巨大的视觉质量优势。

It’s also possible to get significant QP variations across different BUs in the same picture, in accord with their complexity variations. This tuning is called adaptive quantization (AQ) and is more a visual quality fine tuning tool than a rate control mechanism. Modern encoder implementations have built-in mechanisms to analyze the incoming pictures before they are encoded. These pre-analysis tools, which are also called lookahead, often perform spatial and temporal scene content analysis. By using objective metrics, they provide scene complexity information. AQ algorithms make use of this scene complexity information to optimally allocate the bits across different BUs to provide immense visual quality benefits.

正如我们所知,我们的眼睛对场景中较平坦的区域更敏感,而对具有精细细节和更高纹理的区域不太敏感。AQ 算法利用它来增加较高纹理区域中的量化偏移并减少较平坦区域中的量化偏移。因此,更多的比特被赋予眼睛对视觉质量影响敏感的区域。这在图 88 中进行了说明,其中较高纹理区域(如拐角处的文本)被赋予正量化偏移量,并以较浅的阴影表示。其他较平坦的区域被赋予更多具有负量化偏移的位,并以较暗的阴影显示。因此,AQ 可作为一种良好的视觉质量微调工具,允许在整个画面中实现平衡的空间质量。

As we know, our eyes are more sensitive to flatter areas in the scene and are less sensitive to areas with fine details and higher textures. AQ algorithms leverage this to increase the quant offset in higher textured areas and decrease it in flatter areas. Thus, more bits are given to areas where the eyes are sensitive to visual quality impacts. This is illustrated in Figure 88 where higher textured areas like the text around the corners are given a positive quant offset and are indicated with a lighter shade. Other flatter areas are given more bits with a negative quant offset and are shown in a darker shade. Thus, AQ serves as a good visual quality fine tuning tool that allows a balanced spatial quality throughout the picture.

图像

图像

图 88:显示使用自适应量化[2]的量化偏移变化的热图。

Figure 88: Heat map showing quant offset variation using Adaptive Quantization [2].

10.7 概括
● 比特率是影响文件大小和整体视频质量的关键参数. 在更高的比特率下,更多比特可用于对视频进行编码。这会带来更好的视频质量,但会以更大的文件大小为代价。
● 编码器必须遵守的典型约束是延迟、比特率、缓冲区空间和编码速度。
● 给定包括比特率、延迟等的特定设置,任何编码器面临的根本挑战是如何优化比特率和输出编码图像质量。编码器必须要么最大化输出视频质量对于给定的比特率或最小化设置视频质量的比特率。
● 使用失真测量以数学方式量化视频质量. 这通常在每个像素处计算并为帧取平均值。这提供了被比较的块之间相似性的良好度量。
● 编码是一个优化问题,其中输入视频与其输出之间的失真,重建视频被最小化,受到包括比特率和编码延迟在内的一组约束。
● 速率控制算法通过分析视频复杂度并跟踪先前分配的比特和目标比特率来维护比特预算并将比特分配给每张图片和每张图片中的每个块。
● 速率控制算法包含两个重要功能:1) 确定和分配目标比特和 2) 实现目标比特率。
10.8 笔记
  1. 速率控制和 H.264. PixelTools MPEG 专家。http://www.pixeltools.com/rate_control_paper.html 。2017 年出版。2018 年 9 月 22 日访问。
  2. Rate control and H.264. PixelTools Experts in MPEG. http://www.pixeltools.com/rate_control_paper.html. Published 2017. Accessed September 22, 2018.
  3. DOTA2。xiph.org. Xiph.org 视频测试媒体 [derf 的收藏]。https://media.xiph.org/video/derf/。2018 年 10 月 30 日访问。
  4. DOTA2. xiph.org. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed Oct 30, 2018.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

第三部分

Part III

11  编码模式

在前面的章节中,我们已经详细探讨了编码器如何在内部操作以在视频序列中分配目标位。现在让我们回顾一下三种重要的应用级编码模式以及每种模式中的比特率分配机制。这些编码模式与编码标准无关。这意味着每个编码器都可以与一个或多个速率控制集成机制。所有这些模式都可以分别在公开可用的 x264、x265 和 libvpx 版本的 H.264 、H.265和 VP9中访问

In previous chapters we have explored at length how the encoder internally operates to allocate the target bits across the video sequence. Let us now review three important, application-level encoding modes and the mechanisms for bitrate allocation in each of these modes. These encoding modes are agnostic with respect to encoding standards. This means that every encoder can be integrated with one or more of the rate control mechanisms. All these modes are accessible in publicly available x264, x265, and libvpx versions of H.264, H.265 and VP9, respectively.

11.1 VBR编码

可变比特率 (VBR ) 是一种模式,其中编码器改变每个时间段的输出比特量,通常以秒为单位。在 VBR 模式下,编码器允许根据需要将更多比特分配给更复杂的视频片段(如动作或高速运动场景),并使用更少的比特来编码更简单和更静态的片段。这意味着编码器可以容忍所需比特率的剧烈波动,而无需施加严格的限制。通过根据需要对场景进行最佳比特分配,编码器因此能够保持平均比特率并确保最佳视频质量同时。例如,当我们以 4 Mbps VBR 对视频进行编码时,编码器会通过为帧的某些部分提供高达 6 或 7 Mbps 而为其他部分仅提供 2 或 3 Mbps 来改变比特率。然而,最终,整个流或文件的总体平均速率将为 4 Mbps。

Variable bit rate (VBR) is a mode wherein the encoder varies the amount of output bits per time segment, usually in seconds. In VBR mode, the encoder allows more bits to be allocated, as necessary, to the more complex segments of the video (like action or high-motion scenes) and uses fewer bits to encode simpler and more static segments. This means the encoder tolerates the dramatic fluctuations in bitrate needed without imposing severe restrictions. By doing optimal bit allocations as needed to encode the scene, the encoder is thus able to maintain the average bitrate and ensure the best video quality at the same time. For example, when we encode a video at 4 Mbps VBR, the encoder will vary the bit rate by giving some sections of frames as much as 6 or 7 Mbps while giving others only 2 or 3 Mbps. Eventually, however, the overall average rate across the whole stream or file would be 4 Mbps.

使用 VBR 的优势在于它可以产生质量更好的视频,因为编码器在不那么严格的约束下运行。缺点是它可能会消耗更多的比特并导致对目标比特率的遵守较差,尤其是在流有很多复杂场景的情况下。当瞬时比特率超过信道比特率时,没有比特率上限的无限制 VBR 将导致数据包丢失。然而,这可以通过在所谓的上限 VBR中指定和强加瞬时比特率上限来避免模式。

The advantage of using VBR is that it produces a better-quality video as the encoder operates with less rigid constraints. The disadvantages are that it may consume more bits and result in poorer adherence to the target bitrate especially if the stream has lots of complex scenes. Unrestricted VBR without bitrate caps will result in packet drops when the instantaneous bitrate exceeds the channel bitrate. However, this can be avoided by specifying and imposing an upper limit on the instantaneous bitrate in what is called a capped VBR mode.

11.2 CBR编码

恒定比特率(CBR) 编码,另一方面,编码器密切跟踪比特的使用情况。它对周期间隔周围的比特率施加了更严格的约束。此外,它通过在序列持续时间内不允许剧烈的比特率波动(峰值或谷值),以或多或少一致的比特率对视频进行编码。不同帧类型之间存在差异,因为它们具有不同的数据速率(例如,I 帧消耗最多的位数,其次是 P 和 B 帧)。然而,将密切监视跨时间段的比特分配并且跨时间段平均比特开销。编码器还必须考虑视频播放设备将采用的缓冲区模型,并严格遵守此解码器缓冲区模型,以便在流期间的任何时间,缓冲区既不满也不空。

In constant bitrate (CBR) encoding, on the other hand, the encoder closely tracks the bits usage. It imposes more rigorous constraints on the bitrate around periodic intervals. Also, it encodes the video at a more or less consistent bitrate by disallowing drastic bitrate swings (peaks or troughs) for the duration of the sequence. Variation exists among different frame types because these will have different data rates (e.g., I frames consume the highest number of bits followed by P and B Frames). However, the allocation of bits across a time segment will be closely monitored and bit expense averaged across the time segment. The encoder has to also consider the buffer model that the video playback devices will employ and closely adhere to this decoder buffer model such that, at any time during the stream, the buffers are neither full nor empty. This is required for a smooth playback.

在这种模式下,编码器在预定义的缓冲间隔内跟踪消耗的比特数和可用比特数。它施加了比特约束,使得比特率在缓冲间隔内保持不变。发生的变化比 VBR 小得多。编码在一个时间间隔内保持不变,通常约为 1 或 2 秒。CBR 的缺点是,当场景中的活动增加导致比特率需求高于目标速率时,编码器必须优先遵守比特率而不是质量,并施加限制以控制比特率。相对于以 VBR 模式编码的视频,这可能会导致画面质量较低。

In this mode, the encoder keeps track of the number of bits consumed and the available bits over a predefined buffer interval. It imposes bit constraints such that the bitrate is maintained over the buffer interval. The variations that happen are much smaller than in VBR. Encoding is constant over an interval, typically around 1 or 2s. The disadvantage with CBR is that, when there is increased activity in a scene that results in a bit rate demand higher than the target rate, the encoder has to prioritize adherence to the bitrate over quality and impose restrictions to keep the bit rate under check. This could potentially result in a lower picture quality relative to video encoded in VBR mode.

例如,当我们以 4 Mbps CBR 对视频进行编码时,编码器会通过为某些帧提供高达 6 或 7 Mbps 而给其他帧仅 2 或 3 Mbps 来改变比特率,但确保比特率不超过 4 Mbps任何给定的时间段,比如 1 秒或 2 秒。此外,如果有一个帧需要 5 MB,则编码器可能不允许它,具体取决于可以为该帧扩展的可用位数。可以想象,如果有许多帧需要超过 4 Mbps 进行编码,则 CBR 的输出看起来会比 VBR 的输出差

For example, when we encode a video at 4 Mbps CBR, the encoder will vary the bit rate by giving some frames as much as 6 or 7 Mbps while giving others only 2 or 3 Mbps but ensure that the bitrate does not exceed 4 Mbps in any given time period, say, 1s or 2s. Also, if there’s a frame that requires 5 MB, the encoder may not permit it, depending on how many bits are available that can be expended for that frame. As you can imagine, if there are many frames that need more than 4 Mbps to encode, the output of CBR will look worse than that of VBR.

11.3 CRF编码

恒定速率因子(CRF )模式是近年来备受关注的一种新模式。它被用作许多现代编解码器中的默认模式,并且是一种恒定质量的编码模式,因为它优先考虑质量指标并确保视频所有部分的质量恒定。实现恒定质量的传统方法是使用恒定 QP(固定 QP)编码模式,其中将固定 QP 应用于每张图片,从而对图片进行均等压缩,从而在整个序列中产生统一的质量。例如,QP 设置为 25 的固定 QP 编码将为每个帧分配相同的 QP 值 25。但是,由于此模式忽略任何比特率限制,因此它通常会导致序列中视频比特率的大幅波动。

The constant rate factor (CRF) mode is a new mode that has garnered a lot of attention in recent years. It is used as the default mode in many modern codecs and is a constant quality encoding mode, in that it prioritizes the quality metric and ensures a constant quality across all sections of the video. The traditional approach to achieving a constant quality was to use a constant QP (fixed QP) encoding mode where a fixed QP is applied to every picture, thereby compressing the pictures equally and resulting in a uniform quality across the sequence. For example, a fixed QP encoding with QP set to 25 will assign every frame the same QP value of 25. However, as this mode ignores any bitrate constraints, it typically results in large swings in the video bitrate across the sequence.

CRF机制改进了这个想法,以保持所需的感知质量水平。其中一种方法是利用 HVS并使用运动作为度量来改变视频中不同图片的 QP。众所周知,人眼对静态均匀场景和静止物体的变化更为敏感,而对运动中的物体则不太敏感。通过使用此 HVS 特性,CRF 比特率算法能够增加 QP,从而相应地降低运动区域中的比特率,同时增加或维持 QP 以在运动较少的区域中提供更多比特。在高速运动区域,场景中发生了很多事情供眼睛感知,导致没有足够的时间注意到稍高的压缩。但是,在静态区域中,不会发生太多变化,任何影响此设置的微小变化都会很快被眼睛察觉。因此,在此模式下,编码器调整 QP 以提供固定的感知质量输出。

The CRF mechanism improves on this idea to maintain the required level of perceptual quality. One of the ways it does so is by leveraging the HVS and using motion as a metric in varying the QP across different pictures in the video. As we know, the human eye is more sensitive to changes in static uniform scenes and still objects and less sensitive to objects in motion. By using this HVS characteristic, the CRF bitrate algorithm is able to increase the QP and thereby reduce the bitrate accordingly in motion areas while increasing or maintaining the QP to provide more bits in areas of less motion. In areas of high motion, there is a lot going on in the scene for the eyes to perceive, resulting in not enough time to notice the slightly higher compression. However, in static areas, there isn't much happening and any minor change that affects this setting will be quickly perceived by the eye. Thus, in this mode, the encoder adjusts the QP to deliver a fixed perceptual quality output. For example, a CRF encoding QP = 25 will vary the QP, increasing it to, say, 28 for scenes with high motion and lowering it to, say, 23 for scenes with more static content.

虽然这种方法违反直觉,但它可以显着改善主观或感知的视觉质量。但是,应该注意的是,这种机制也可能导致视频质量较低,由 PSNR 客观测量。由于感知质量还取决于图片的均匀性,因此 QP 调整始终是渐进的。此外,使用增加的 QP 可能仍然会导致在高运动区域中使用更高的比特,因为它们非常复杂。

While this approach is counterintuitive, it serves to improve subjective or perceived visual quality significantly. However, it should be noted that this mechanism could also result in a lower video quality, as objectively measured by PSNR. As perceptual quality also depends on uniformity in the pictures, the QP adjustments are always gradual. Also, using an increased QP might still result in a higher bit usage in the high motion areas just because of their sheer complexity.

11.4 何时使用 VBR或 CBR

表 14:CBR 中比特分配的比较和可变比特率模式。

Table 14: Comparison of bit allocations in CBR and VBR modes.

恒定比特率 (CBR)

Constant Bit Rate (CBR)

可变比特率 (VBR)

Variable Bit Rate (VBR)

可变视频质量, 通常比 VBR 差

Variable video quality, usually worse than VBR

稳定、可定义的视频质量和最高的视频质量

Constant, definable video quality and highest video quality

可预测的文件大小

Predictable file sizes

不可预测的文件大小

Unpredictable file sizes

与大多数系统兼容

Compatible with most systems

不可预测的兼容性

Unpredictable compatibility

固定带宽管道传输应用

Transmission applications with fixed bandwidth pipe

仅定义最终大小限制的存储应用程序

Storage applications where only final size limit is defined

在选择 VBR或 CBR时,通常取决于相关应用。默认情况下,只要应用程序允许,始终建议使用某种形式的 VBR 编码。然而,在传输基础设施具有固定带宽管道的应用中,CBR 通常是首选选项,因为它通过确保比特率不会变化以溢出可用带宽管道来提供更高的可靠性。此外,某些播放设备仅支持 CBR 模式。这是为了限制解码器在解码和播放过程中必须在内部缓冲的数据量。为了获得最大的设备兼容性,CBR 可以提供更安全的选择。表 14 突出显示了 CBR 和 VBR 操作模式之间的常见差异。

When it comes to selecting VBR or CBR, it usually depends on the application in question. By default, it is always recommended to use some form of VBR encoding whenever the application permits it. However, in applications where the transmission infrastructure has a fixed bandwidth pipe, CBR is usually the preferred option as it provides much more reliability by ensuring the bitrate does not vary to overflow the available bandwidth pipe. Also, some playback devices only support CBR mode. This is in order to limit the amount of data the decoder must internally buffer during the process of decoding and playback. For maximum device compatibility, CBR could provide a safer option. The common differences between CBR and VBR modes of operation are highlighted in Table 14.

以下示例说明了 VP9 参考编码器如何在示例 CBR和 VBR配置中为以 3 Mbps 目标编码的测试剪辑分配位。图 89 中的图表显示了剪辑持续时间内比特变化的趋势线。正如我们从图中看到的,与 VBR 模式相比,CBR 模式下的比特率波动要小得多。使用的测试脚本如下:

The following example illustrates how the VP9 reference encoder allocates bits in a sample CBR and VBR configuration for a test clip encoded at a target of 3 Mbps. The graph in Figure 89 shows the trendline for the variation of bits over the duration of the clip. As we see from the figure, the bitrate fluctuations are far lower in the CBR mode compared to VBR mode. The test scripts that were used are as follows:

CBR配置:

CBR Configuration:

./vpxenctest_1920x1080_25.yuv -o test_1920x1080_25_vp9_cbr.webm --codec=vp9 --i420 -w 1920 -h 1080 -p 1 -t 4 --cpu-used=4 --end-usage=cbr --target-bitrate=3000 - -fps=25000/1001 --undershoot-pct=95 --buf-sz=18000 --buf-initial-sz=12000 --buf-optimal-sz=15000 -v --kf-max-dist=999999 - -最小-q=4 --最大-q=56

./vpxenc test_1920x1080_25.yuv -o test_1920x1080_25_vp9_cbr.webm --codec=vp9 --i420 -w 1920 -h 1080 -p 1 -t 4 --cpu-used=4 --end-usage=cbr --target-bitrate=3000 --fps=25000/1001 --undershoot-pct=95 --buf-sz=18000 --buf-initial-sz=12000 --buf-optimal-sz=15000 -v --kf-max-dist=999999 --min-q=4 --max-q=56

 

 

VBR 2 遍配置:

VBR 2-Pass Configuration:

./vpxenctest_1920x1080_25.yuv -o test_1920x1080_25_vp9_vbr.webm --codec=vp9 --i420 -w 1920 -h 1080 -p 2 -t 4 --best --target-bitrate=3000 --end-usage=vbr --auto-alt -ref=1 --fps=25000/1001 -v --minsection-pct=5 --maxsection-pct=800 --lag-in-frames=16 --kf-min-dist=0 --kf-max -dist=360 --static-thresh=0 --drop-frame=0 --min-q=0 --max-q=60

./vpxenc test_1920x1080_25.yuv -o test_1920x1080_25_vp9_vbr.webm --codec=vp9 --i420 -w 1920 -h 1080 -p 2 -t 4 --best --target-bitrate=3000 --end-usage=vbr --auto-alt-ref=1 --fps=25000/1001 -v --minsection-pct=5 --maxsection-pct=800 --lag-in-frames=16 --kf-min-dist=0 --kf-max-dist=360 --static-thresh=0 --drop-frame=0 --min-q=0 --max-q=60

 

 

图像

图 89:CBR 中的比特分配比较和可变比特率模式。

Figure 89: Comparison of bit allocations in CBR and VBR modes.

下面重点介绍不同的典型编码应用场景和码率控制的选择每一个的模式。

The following section highlights different typical encoding application scenarios and the choice of the rate control mode for each of these.

11.4.1 视频直播

这是通过地面、卫星或有线网络进行视频分发的编码,要求在固定的可用带宽中实时编码和打包尽可能多的频道。视频质量在这里至关重要,通常以大量延迟和缓冲区(通常为几秒)为代价来实现。通常,在这种情况下,CBR或某种形式的统计复用 VBR模式用于对视频进行编码。统计复用是一种 VBR 技术,它随时利用比特率池中所有通道的相对复杂性,并使用时分复用算法根据这些通道的复杂性将比特分配给每个通道。整个通道池比特率保持 CBR,但在池内,各个通道本身动态分配 VBR。

This is encoding for video distribution over terrestrial, satellite or cable networks where the requirement is to encode and pack as many channels as possible in a fixed available bandwidth in real time. Video quality is of utmost importance here and is usually achieved at the expense of large latency and buffers, typically a few seconds. Usually, CBR or some form of statistically multiplexed VBR mode is used to encode video in this case. Statistical multiplexing is a VBR technique that leverages the relative complexity of all channels in a bit rate pool at any time and uses time division multiplexing algorithms to allocate bits to each of these channels in accord to their complexity. The overall channel pool bitrate remains CBR but within the pool, the individual channels themselves are allocated VBR dynamically.

11.4.2 实时互联网视频流

此用例包括现场直播、over the top(OTT) 流媒体,用于广播内容、在线游戏和个人直播应用程序,如 Facebook 直播。该工作流程与现场视频广播非常相似,因为需要在固定带宽网络中进行实时视频编码。然而,与传统广播的主要区别在于客户端比特率不是固定的,而是根据网络条件动态变化的。此外,在不同位置提供相同视频服务的不同客户端可能具有非常不同的网络条件,因此频道比特率也不同。因此,仅创建视频的一个编码版本无法有效地服务于具有不同且不断变化的需求的所有最终用户。这个问题通过使用自适应流媒体解决,其中相同内容的多个版本以不同的比特率和分辨率编码并流式传输。比特率阶梯被定义为具有使用不同分辨率和比特率编码的相同视频。通常,每个版本都是 CBR编码。因此,根据不同用户的网络状况,向不同用户提供这些内容版本之一。此外,每个客户端的视频编码版本也可以在播放期间动态切换到更高或更低的比特率,以适应不断变化的网络条件。

This use case includes live, over the top (OTT) streaming for broadcast content, online gaming and personal live broadcast applications like Facebook live. The workflow is quite similar to live video broadcasting in that real-time video encoding in fixed bandwidth networks is needed. However, the main difference from traditional broadcasting is that the client bitrate is not fixed but varies dynamically based on network conditions. Furthermore, different clients who are served the same video across different locations can have very different network conditions, hence channel bitrates. Thus, creating just one encoded version of the video cannot effectively serve all end users who have different and changing requirements. This problem is solved by using adaptive streaming where multiple versions of the same content are encoded at different bitrates and resolutions and streamed. A bitrate ladder is defined that has the same video encoded using different resolutions and bitrates. Usually, every version is CBR encoded. Different users are thus served one of these versions of the content, based on their network conditions. Moreover, every client’s encoded version of the video can also be dynamically switched to a higher or lower bitrate during playback to suit the changing network conditions.

11.4.3 视频点播流媒体

由于 Netflix、Amazon 和 Hulu 等服务,视频点播(VOD) 应用程序变得越来越流行。编码后的内容在线存储,用户可以实时播放访问。VOD 类似于实时互联网视频流,因为它的视频通过互联网流式传输并使用自适应编码。然而,编码是离线完成的,多次通过和非实时编码是首选,以提高视频质量. 具有上限 VBR 的多通道编码非常适合此应用。

Video on demand (VOD) applications are becoming increasingly popular thanks to services like Netflix, Amazon, and Hulu. The encoded content is stored online and accessed by the user in real-time playback. VOD is similar to live internet video streaming as its video streamed over the internet and uses adaptive encoding. However, the encoding is done offline and multiple passes and non-real time encoding is preferred, in order to increase the video quality. Multi-pass encoding with capped VBR can be well suited for this application.

11.4.4 贮存

企业和私人用户使用它来将编码视频存储在个人驱动器或云存储中以供存档。目标是在不过度关注文件大小的情况下获得尽可能好的质量。实时或非实时 CRF编码将是用于此类应用程序的良好编码模式。但是,如果使用 DVD 或蓝光光盘等设备进行存储,则还必须考虑固定的大小限制。这些情况下的编码是使用某种形式的多通道编码,以实时或非实时方式使用封顶 VBR 模式完成的。

This is used by enterprise and private users to store encoded video in personal drives or cloud storage for archival purposes. The goal is to achieve the best possible quality without too much concern about the file size. Real-time or non-real time CRF encoding would be a good encoding mode to use for this class of applications. If, however, devices like DVDs or Blu Ray Disks are used for storage, there are fixed size restrictions that also have to be considered. The encoding in these cases is done with capped VBR mode, either in real time or non-real time, using some form of multi-pass encoding.

11.5 概括
● 不同的应用场景定义了编码器如何为帧分配比特。三种重要的比特率模式是:1) CBR ,2) VBR ,和 3) CRF
● 在 CBR编码中,编码器在周期性间隔周围对比特率施加更严格的限制,并通过禁止剧烈的比特率摆动以或多或少一致的速率进行编码。
● 在 VBR模式下,编码器根据需要允许更多的比特用于更复杂的视频片段,并使用更少的比特来编码更简单和更静态的片段。
● CRF模式是一种新模式。它是一种恒定质量编码模式,可优先考虑质量指标并确保视频所有部分的固定质量。
● 与 VBR 模式相比, CBR是一种更可预测的模式,可在更广泛的系统中兼容

12 表现

视频编码通常是一个不可逆的有损过程,其中编码视频是源的良好近似,并且这种近似的质量取决于各种编码参数,如我们在本书前面讨论的量化参数 (QP)。据推测,编码视频相对于源而言是降级的。编码的质量是通过与源相比感知到的视频质量下降来衡量的。编码过程产生的失真或伪像会对用户体验产生负面影响,这对于部署这些系统的内容提供商和服务提供商来说至关重要。

Video encoding is often an irreversible, lossy process wherein the encoded video is a good approximation of the source and the quality of this approximation depends on various encoding parameters like the quantization parameter (QP) that we discussed earlier in this book. Presumably, the encoded video is degraded relative to the source. The quality of encoding is gauged using a measure of this perceived video degradation compared to the source. The distortion or artifacts produced by the encoding process negatively impact the user experience and this is of paramount importance for content providers and service providers who deploy these systems.

任何视频编码器最重要的特性是质量。任何视频编码器都会通过使用代表各种内容的输入视频序列并分析编码输出来评估其性能。这些剪辑通常使用各种目标比特率的标准设置进行编码。有两种广泛的方法来评估输出视频质量:

The most important characteristic of any video encoder is quality. Any video encoder goes through evaluations to assess how it performs by using input video sequences that represent a broad variety of content and analyzing the encoded outputs. These clips are typically encoded using standard settings at various target bitrates. There are two broad ways to evaluate the output video quality:

  1. 客观分析. 这使用近似于主观质量评估的数学模型。评估是使用计算机程序自动计算的。使用此方法的优点是它易于量化,并且始终为给定的一组输出和输入提供统一和一致的结果。然而,它的局限性通常在于模型能够接近人类感知的准确程度。虽然客观分析有多种指标,但本章将讨论业界使用越来越多的三种工具,即 PSNR 、SSIM和 VMAF 。
  2. Objective Analysis. This uses mathematical models that approximate a subjective quality assessment. Assessments are automatically calculated using a computer program. The advantage of using this method is that it is easily quantified and always provides a uniform and consistent result for a given set of outputs and inputs. However, its limitations are usually in terms of how accurately the model can approximate human perception. While there are several metrics for objective analysis, three tools that are increasingly used in the industry, namely, PSNR, SSIM and VMAF are discussed in this chapter.
  3. 主观分析. 在这里,一组测试视频剪辑被展示给一组观众,他们的反馈通常以某种形式的评分系统被平均为平均意见分数. 虽然这种方法不容易量化,但它是最常用的方法。这是因为它比客观分析更简单,并且直接与用户的真实体验相关联,用户是感知质量的最终评判者。但是,测试过程可能会有所不同,具体取决于可用的测试设置、用于测试的编码器等。主观分析也容易产生用户偏见和意见。
  4. Subjective Analysis. Here, the set of test video clips is shown to a group of viewers and their feedback, which is usually in some form of a scoring system, is averaged into a mean opinion score. While this method is not easily quantifiable, it’s the most frequently used method. This is because it’s simpler than objective analysis and it connects directly to the real-world experiences of users, who are the ultimate judge of perceived quality. However, the testing procedure may vary depending on what testing setup is available, what encoders are used for the testing, and so on. Subjective analysis is also prone to user bias and opinions.
12.1 客观视频质量指标                  

越来越需要开发客观的质量测量技术,因为它们为视频开发商、标准组织和其他企业提供了评估视频质量的工具无需查看视频即可自动进行。除了对视频算法进行基准测试的能力外,客观指标还可以作为算法的一部分嵌入到视频编码管道中,以在编码过程本身中优化和微调质量。应该注意的是,大多数客观视频质量(VQ) 指标都假设未失真的源可用于分析。此类指标称为完整参考 (FR) VQ 指标,其中最常见的是峰值信噪比(PSNR ) 指标。这被广泛使用,因为它计算简单,并且可以很容易地集成到算法中以进行优化。然而,PSNR 有其局限性。它有时与感知视频质量的相关性不高,这意味着视频在视觉上看起来不错,但 PSNR 值仍然很差,反之亦然。尽管有这些固有的局限性,它仍然是业内最简单、使用最广泛的指标之一。

There has been an increasing need to develop objective quality measurement techniques because they provide video developers, standards organizations and other enterprises with the tools to evaluate video quality automatically without the need to view the video. In addition to the ability to benchmark video algorithms, the objective metrics can also be embedded into video coding pipelines as part of the algorithms to optimize and fine tune the quality during the encoding process itself. It should be noted that a majority of the objective video quality (VQ) metrics assume that the undistorted source is available for analysis. Such metrics are called full reference (FR) VQ metrics, the most common of which is the peak signal-to-noise ratio (PSNR) metric. This is widely used as it’s simple to calculate and can be easily integrated within algorithms for optimization. However, PSNR has its limitations. It sometimes does not correlate well with perceived video quality, meaning that a video can look visually good but still have a poor PSNR value and vice versa. Despite such inherent limitations it remains one of the easiest and most widely used metrics in the industry.

在本节中,我们将回顾三个这样的客观指标,并考虑每个指标的优缺点。

In this section, we will review three such objective metrics and consider the advantages and disadvantages of each.

12.1.1 峰值信噪比 (PSNR)

PSNR是输入信号的最大功率与压缩误差(噪声)功率之间的比率。它以对数刻度表示。该度量通常提供对重构输出的感知质量的良好近似。较高的 PSNR 值对应于较高的视觉质量。PSNR 比率中的分母涉及使用均方误差计算的压缩误差的幂(MSE) 在源(或参考)和编码(以及随后重建的)视频之间。因此 PSNR 是这个 MSE 值的简单函数。

PSNR is the ratio between the maximum power of an input signal and the power of compression error (noise). It is expressed in a logarithmic scale. The metric usually provides a good approximation of the perceived quality of the reconstructed output. A higher PSNR value corresponds to higher visual quality. The denominator in the PSNR ratio involves the power of the compression error which is computed using the mean squared error (MSE) between the source (or reference) and the encoded (and subsequently reconstructed) video. PSNR is thus a simple function of this MSE value.

PSNR定义为:

The PSNR is defined as:

PSNR = 10.log 10 (MAX 2 / MSE)

PSNR = 10. log 10 (MAX2 / MSE)

其中 MSE 定义为:

where MSE is defined as:

MSE = i j |Orig i, j - Recon i, j | 2个

MSE = i j |Orig i, j - Recon i, j|2

因此,MSE 是原始源像素与重建像素之间的平方误差的平均值像素。MAX 是最大像素值,位深度为 8 时为 255。PSNR计算通常逐帧应用于所有分量(尤其是亮度),并使用整个视频序列的平均值。对于亮度,视频编码中 PSNR 的典型值在 25 到 50 dB 之间,这取决于用于编码的视频内容、比特率和 QP 值。基于 HVS,45dB 及以上的 PSNR 通常对应于视频中难以察觉的视觉质量影响。

The MSE is thus the average of the squared error between the original source pixels and the reconstructed pixels. MAX is the maximum pixel value, which is 255 for a bit depth of 8. The PSNR computation is usually applied frame-by-frame on all the components (especially luma) and the average value for the entire video sequence is used. Typical values for the PSNR in video encoding are between 25 and 50 dB for luma and this depends on the video content, bit rate and QP values used for encoding. Based on the HVS, a PSNR of 45dB and above usually corresponds to imperceivable visual quality impact in the video.

12.1.2 结构相似性 (SSIM)            

SSIM是一种流行的静态图像质量评估方法。它首先被提出用于图像[ 1],并已扩展到视频。和 PSNR 一样,SSIM 是另一个完整的参考(FR) 度量用于测量两个图像之间的相似性,旨在改进传统方法,如 PSNR 和 MSE。SSIM 背后的想法是 HVS高度专注于从视觉内容中提取结构 信息,编码器保存结构信息越好,感知的视觉质量就越高。PSNR 等传统方法关注像素残差错误。HVS 不会从图像中提取这些;因此,它们可能与感知质量没有直接关系。基于结构失真的度量将与 HVS 感知的视觉质量具有高度相关性。这将是一个更好的选择,因为它有效地融合了主观和客观测试。

SSIM is a popular method for quality assessment of still images. It was first proposed for images [1] and has been extended to video. Like PSNR, SSIM is another full reference (FR) metric used for measuring the similarity between two images and is designed to improve on traditional methods such as PSNR and MSE. The idea behind SSIM is that the HVS is highly specialized in extracting structural information from visual content and the better the encoder preserves the structural information, the higher is the perceived visual quality. Traditional methods like PSNR focus instead on pixel residual errors. The HVS does not extract these from images; hence, they may not directly correlate to perceived quality. A metric based on structural distortion would have a high correlation with visual quality as perceived by the HVS. This would be a better option, in that it effectively blends subjective and objective testing.

图 90 来自 Wang、Lu 和 Bovik [1],说明了上述概念。图 90 (a) 中的原始Goldhill图像[ 1]受到以下影响:(b) 失真,包括全局对比度抑制,(c) 剧烈的 JPEG 压缩,以及 (d) 模糊。在这些测试中,所有图像都针对与源图像相关的相似 MSE 进行了设置。这意味着这些图像中的每一个都会产生相似的 PSNR值。

Figure 90, from Wang, Lu, & Bovik [1], illustrates the above concepts. The original Goldhill image [1] in Figure 90 (a) has been subjected to the following: (b) distortions including global contrast suppression, (c) drastic JPEG compression, and (d) blurring. In these tests, all the images were set up for a similar MSE relative to the source image. This means that each of these images would yield similar PSNR values.

从图片中可以明显看出,尽管 MSE 相似,但失真图像却大不相同。一眼就足以看出图片 (b) 比其他失真图像更具视觉吸引力。在 JPEG 压缩 (c) 和模糊 (d) 图像中,几乎没有保留原始图像的任何结构,因此也看不到它们。另一方面,图像结构保留在对比度抑制图像 (b) 中。这里的要点是,像 PSNR 这样的基于错误的指标在这种情况下很容易失败,并且可能会产生误导。

It is visually obvious from the pictures that, despite the similarity in MSE, the distorted images are very different. A glance is sufficient to perceive that picture (b) is far more visually appealing than the other distorted images. In the JPEG compressed (c) and blurred (d) images, hardly any structures of the original image are preserved, hence they can't be seen, either. On the other hand, the image structures are preserved in the contrast-suppressed image (b). The point here is that error-based metrics like PSNR are prone to fail in scenarios like this and can be misleading.

图像

图 90:PSNR 相似但结构内容不同的图像比较。

Figure 90: Comparison of images with similar PSNRs but different structural content.

另一方面, SSIM取消了基于错误的计算。相反,它利用了 HVS 的特性关注结构信息。SSIM 定义了一个模型来测量基于其结构信息变化的图像质量退化。结构信息的思想是图像或视频画面中像素之间的强空间相关性携带了关于画面中物体结构的重要信息。这被基于错误的指标(如 PSNR)忽略,它独立处理每个像素。如果 x = {xi | i = 1, 2,…,N} 是原始信号,y = {yi | i = 1,2,…,N} 是重建的信号,然后使用以下公式[1]计算 SSIM 指数:

SSIM, on the other hand, does away with error-based computations. Instead, it leverages the characteristic of the HVS to focus on structural information. SSIM defines a model to measure image quality degradation based on changes in its structural information. The idea of structural information is that the strong spatial correlations among pixels in an image or video picture carry important information about the structure of the objects in the picture. This is ignored by error-based metrics like PSNR that treat every pixel independently. If x = {xi | i = 1, 2,…,N} is the original signal and y = {yi | i = 1,2,…,N} is the reconstructed signal, then the SSIM index is calculated using the following formula[1]:

图像

在这个方程中, 和 分别是 x 和 y 的平均值,而 σ x , σ y , σ xy是 x 的方差,y 的方差以及 x 和 y 的协方差。A 和 B 是基于位深度定义的常量。SSIM 的取值范围为 0 到 1,1 为最佳值。A 1 表示重构图像与原始图像相同。一般来说,0.95 及以上的 SSIM 分数被发现具有不可察觉的视觉质量影响(类似于大于 45dB 的 PSNR)。与 PSNR 一样,SSIM 索引是在视频的所有三个分量上逐帧计算的,视频的整体 SSIM 索引(对于每个分量)计算为所有帧值的平均值。

In this equation, and are the mean of x and y, respectively, and σx, σy, σxy are the variance of x, the variance of y and the covariance of x and y. A and B are constants that are defined based on bit depths. The value of SSIM ranges from 0 to 1 with 1 being the best value. A 1 means that the reconstructed image is identical to the original image. In general, SSIM scores of 0.95 and above are found to have imperceivable visual quality impact (similar to PSNR greater than 45dB). As with PSNR, the SSIM index is computed frame-by-frame on all three components of the video separately and the overall SSIM index for the video (for every component) is computed as the average of all the frame values.

12.1.3 视频多方法评估融合 (VMAF)

视频内容高度多样化,各种内容都可能出现不同类型的失真。视频质量像 PSNR 这样的指标虽然适用于某些源和错误特征,但可能无法为其他指标提供最佳评估。因此,一个新兴趋势是结合主要的现有指标并将它们组合使用以推导出融合方法。新型机器学习工具可用于根据源内容和工件特征为不同的基本指标分配权重。使用现有的客观方法,从视频中提取特征并将其用于馈送机器学习算法以获得用于预测感知 VQ 的训练模型。虽然这种方法确实增加了复杂性,但它已被证明可以实现显着更好的性能。此外,学习工具的性能也可以不断提高。

Video content is highly diversified and different kinds of distortions are possible across a variety of content. A video quality metric like PSNR while suitable for certain source and error characteristics, might not provide optimal assessment for others. An emerging trend, thus, is to combine the major existing metrics and use them in combination to derive a fusion method. Novel machine learning tools can be used to assign weights to the different elementary metrics in accordance with the source content and artifact characteristics. Using the existing objective methods, features from the video are extracted and used to feed machine learning algorithms to obtain a trained model that is used to predict the perceived VQ. While such a method does increase the complexity, it has been demonstrated to achieve significantly better performance. Furthermore, the performance of the learning tools can also be continuously improved.

C.-C.教授 Jay Kuo 和他的同事研究了 10 个现有的和更被认可的客观质量指标和 17 种图像失真类型(如 JPEG 压缩失真、量化噪声等)[2]。据观察,不同的质量指标对于不同的图像失真类型效果很好。例如,PSNR 可能无法准确测量许多失真类型的质量,但它适用于加性噪声和量化噪声失真。一般而言,发现基于 PSNR 的指标对一半的失真类型效果很好,而特征相似性指标指标对其余失真类型效果很好。因此,他们有动力开发一种统一的方法,通过将来自首选基本指标的失真指数融合到一个最终分数中来处理所有不同的失真。

Prof C.-C. Jay Kuo and his colleagues studied ten existing and better-recognized objective quality metrics and 17 image distortion types (such as JPEG compression distortion, quantization noise, and so on) [2]. It was observed that different quality metrics work well with respect to different image distortion types. For example, PSNR may not accurately measure quality for many distortion types, but it works well for additive noise and quantization noise distortions. In general, PSNR-based metrics were found to work well for half of the distortion types, while a feature similarity index metric worked well for the remaining distortions. Thus, they were motivated to develop a unified method that handles all the different distortions by fusing the distortion indices from the preferred elementary metrics into one final score.

视频多方法评估融合(VMAF ) 是 Netflix 与 C.-C 教授合作开发的一种融合指标。Jay Kuo 和他在南加州大学的团队。它最近引起了视频社区的兴趣。VMAF 使用称为支持向量机(SVM) 回归器的机器学习算法组合三个基本 VQ 指标来计算量化主观质量的指标[ 3]. SVM 算法为每个基本度量动态分配权重以导出最终度量。由于它保留并利用了每个指标的优势,因此 SVM 指标被认为是对感知主观质量的更好估计。例如,如果主观平均意见得分之间的相关性(MOS) 并且基本指标很高,SVM 可能会分配更高的权重,反之亦然。可以使用主观实验数据对机器学习模型进行训练,以便生成的权重和生成的跨各种内容的 VMAF 指数准确反映感知质量。

Video multimethod assessment fusion (VMAF) is one such fusion metric that has been developed by Netflix in cooperation with Prof C.-C. Jay Kuo and his team at the University of Southern California. It has garnered the interest of the video community in recent times. VMAF computes an index that quantifies subjective quality by combining three elementary VQ metrics using a machine learning algorithm called support vector machine (SVM) regressor [3]. The SVM algorithm  assigns weights dynamically to each elementary metric to derive the final metric. As it preserves and leverages the strengths of each metric, the SVM metric is considered a better estimate of perceived subjective quality. For example, if the correlation between the subjective mean opinion score (MOS) and an elementary metric is high, the SVM may assign a higher weight and vice versa. The machine-learning model can be trained using subjective experimental data such that the weights produced and the resulting VMAF index across a variety of content accurately reflect perceptual quality.

Netflix VMAF算法[ 3]结合了两个图像质量基本指标,即视觉信息保真度(VIF) 和细节损失指标(DLM)。它还通过计算亮度平均绝对像素差异来考虑时间特性来合并运动信息。上述指标使用 SVM 回归算法融合,为每一帧提供一个分数。然后对所有视频进行平均,以得出整个序列的最终整体差分 MOS (DMOS) 值。开发一种主观测试方法也很重要,该方法可生成 VMAF 可用于训练内部机器学习模型的 MOS 数据。此外,通过使用此框架,可以通过试验其他基本指标、特征和不同的机器学习算法来实现特定于应用程序的自定义融合 VMAF指标。

The Netflix VMAF algorithm [3] incorporated two image quality elementary metrics, namely, visual information fidelity (VIF) and detail loss metric (DLM). It also incorporated motion information by calculating the luma mean absolute pixel differences to account for the temporal characteristics. The above metrics are fused using the SVM regression algorithm to provide a single score for every frame. This is then averaged over all the videos to derive the final overall differential MOS (DMOS) value for the entire sequence. It’s important to also develop a subjective testing method that yields MOS data that can be used by VMAF to train the internal machine learning model. Also, by using this framework, application-specific customized fusion VMAF metrics can be implemented by experimenting with other elementary metrics, features, and different machine learning algorithms.

一些研究[4]表明,VMAF得分为 93 分及以上时,对 VQ 的影响微乎其微,而在较低的分数下,大约 6 VMAF 分的变化会导致明显的 VQ 影响。此外,据了解,低分辨率视频的 VMAF 分数比高分辨率视频差很多。

Some studies [4] have shown that a VMAF score of 93 and above results in imperceivable VQ impact and at lower scores, a change of about 6 VMAF points results in noticeable VQ impact. Furthermore, it is also understood that lower resolution videos have VMAF scores a lot worse than higher resolution videos.

12.2 编码器实现

我们讨论过的各种编解码器有许多可用的实现。H.264自然是采用最多和使用最广泛的。在本节中,我们将继续关注三个重要的编解码器。这些提供了对当前可用的编码工具和实施选项的丰富见解。这三个编解码器中的每一个都有至少一个可免费下载的源代码,可以使用命令行编译和运行。或者,它们也可以在广受欢迎的 FFmpeg 下使用框架,如果它已经编译了相应的编解码器支持。使用 FFmpeg 可能会在一个统一的框架下提供更多的灵活性和更多的选择。这可能对熟悉该工具的一些人有用,对于那些拥有独立编译的编解码器可执行文件就可以探索各种编解码器工具选项的人来说。

There are a number of implementations available for the various codecs we have discussed. H.264 has naturally been the most adopted and widely used. In this section, we shall continue to focus on the three important codecs. These offer a rich insight in to the coding tools and implementation options available at the present time. Each of these three codecs has at least one freely downloadable source that can be compiled and run using the command line. Alternatively, they can also be used under the widely popular FFmpeg framework if it has been compiled with the corresponding codec support. Using FFmpeg may offer much more flexibility and more options under a unified framework. This may be useful to some who are familiar with the tool and for whom having a standalone compiled codec executable is just fine for the purposes of exploring the various codec tool options.

在编解码器比较方面,将 VP9 和 H.265 等较新编解码器与现有的主流 H.264进行比较的各种结果已广泛发布并可在互联网上获得。必须仔细分析这些,因为其中一些比较可能已经过时。使用改进版本的编解码器的更新结果可能不可用。一般来说,很难比较不同的编解码器,因为它们通常有不同的工具集,而且并非所有工具都适用于不同的实现。这可能会导致非同类比较场景。在比较编解码器时,从整体上考虑以下三个方面。

In terms of codec comparisons, various results of comparing newer codecs like VP9 and H.265 to the existing and dominant H.264 are widely published and available on the internet. These have to be carefully analyzed as some of these comparisons may be outdated. Updated results with improved versions of the codecs may not be available. In general, it’s hard to compare different codecs as they usually have different tool sets and not all tools are available across different implementations. This can lead to non-apples-to-apples comparison scenarios. When comparing codecs, the following three aspects are considered for a holistic picture.

 

 

  1. 编码器在不同设置下提供的压缩效率
  2. Compression efficiency offered by the encoder across different settings
  3. 编码速度
  4. Encoding speed
  5. 解码速度
  6. Decoding speed

在本节中,我们将专注于编码压缩效率工具。这些通常是新编解码器开发和不同编码器实现的主要动机。下面讨论的每个实现都是行业实际参考,用于对其他实现进行基准测试。

In this section, we shall focus exclusively on encoding compression efficiency tools. These in general are the primary motivation for new codec development and different encoder implementations. Each of the implementations discussed below are industry de facto references that are used to benchmark other implementations.

12.2.1 H.264编码器

广泛使用的 H.264 实现VideoLAN 开发的x264软件库。它可以免费下载,并根据 GNU GPL 条款发布。另一个流行的实现是 Cisco Systems 开发的 open h264 软件。这是开源的,可在 BSD 许可下使用。正如 MPEG LA 所宣布的那样,这些用于互联网视频的软件实施是免费的,MPEG LA 是管理适用于 H.264 标准的专利许可的私人组织。

The widely used implementation for H.264 is the x264 software library developed by VideoLAN. It is available as a free download and is released under the terms of the GNU GPL. Another popular implementation is the open h264 software developed by Cisco Systems. This is open sourced and available under the BSD license. The use of these software implementations for internet video is free as announced by MPEG LA, the private organization that administers the licenses for patents applying to the H.264 standard.

12.2.2 H.265编码器

x265是 H.265标准最流行的公开可用实现。x265 项目建立在 x264 的源代码之上,该项目的源代码于 2013 年由一家名为 MulticoreWare 的公司公开提供。x265 在 GNU GPL 或商业许可下提供,类似于 x264。x265 已集成到流行的 API 中,包括 FFmpeg和 Handbrake 等,并且与 x264 一样,它可以作为独立的可执行文件或使用 FFmpeg API 运行。在不同时间的各种比较研究中,x265 实现已在压缩效率方面与早期的 x264 和 VP9 进行了广泛比较。它在不同的测试条件和不同的指标下一直表现良好,包括 SSIM和新的 VMAF 公制。

x265 is the most popular publicly available implementation of the H.265 standard. The x265 project builds on source code from x264 and the source code for this project was made publicly available by a company called MulticoreWare in 2013. x265 is offered under either GNU GPL or a commercial license, similar to x264. x265 has been integrated into popular APIs including FFmpeg and Handbrake among others and, like x264, it can be run as a standalone executable or using the FFmpeg APIs. The x265 implementation has been widely compared for compression efficiency to earlier x264 and also VP9 under various comparison studies at different times. It has consistently performed well under different test conditions and different metrics, including SSIM and the new VMAF metric.

12.2.3 VP9编码器

VP9 源代码(又名 libvpx)可在 Google 的 WebM项目下获得。这可以使用可执行文件 vpxenc在命令行中下载、编译和运行. WebM 项目 Wiki 中提供了 VP9 的文档,该 Wiki 还提供了一些针对 VOD、DASH、恒定质量和受限质量的 VP9 特定设置。该实现提供使用 VBR 、 CBR 、恒定质量和受限质量的编码,并且可以使用 1-pass 或 2-pass 编码模式。应该注意的是,并非所有 libvpx 配置模式都提供实时编码。使用 libvpx 编码的 VP9 比特流以 WebM 格式容器化。这是 Matroska 容器格式的一个子集。

VP9 source code (a.k.a. libvpx) is available under Google’s WebM project. This can be downloaded, compiled and run in the command line using the executable vpxenc. The documentation for VP9 is provided in the WebM project Wiki that also provides few VP9-specific settings for VOD, DASH, constant quality, and constrained quality. The implementation provides encoding using VBR, CBR, constant quality and constrained quality and both 1-pass or 2-pass encoding modes are available. It should be noted that not all libvpx configuration modes offer real-time encoding. VP9 bitstreams encoded using libvpx are containerized in WebM format. This is a subset of the Matroska container format.

12.3 视频质量评估

如前所述,各种编解码器的压缩效率有几个已发布的测试结果,尤其是 H.264 、H.265和 VP9。其中一些与其他的在实际结果和结论方面截然不同。测试过程通常涉及使用跨比特率范围的标准化设置对一组选定的不同内容的视频剪辑进行编码,然后使用 PSNR、SSIM 或VMAF等指标测量客观质量。这是针对所有测试编解码器完成的,并且使用比特率与目标质量的图表绘制结果并进行比较。由于 SSIM 通常被认为是主观和客观测试之间更好的平衡,所以让我们将它用于我们的测试。

As mentioned earlier, there are several published testing results for compression efficiency of various codecs, especially H.264, H.265 and VP9. Some of these are starkly different from the others in terms of their actual results and conclusions. The test process usually involves encoding a select set of video clips of varying content using standardized settings across a range of bit rates and then measuring the objective quality using a metric like PSNR, SSIM or VMAF. This is done for all the test codecs and the results are plotted using bit rate-to-objective quality graphs and compared. As SSIM is often considered a much better balance between subjective and objective testing, let us use it for our tests.

一般来说,寻求运行测试以评估编码器的工程师和学生可以使用以下步骤:

In general, engineers and students seeking to run tests to evaluate encoders can use the following steps:

  1. 仔细选择一组测试剪辑来表示各种内容(高速运动、详细纹理、静态、电影、动画、嘈杂内容等)和分辨率(通常是 UHD、HD、SD 和低至一些较低的分辨率) . 对于每个分辨率的每个剪辑,需要执行以下步骤。
  2. Carefully select the set of test clips to represent a variety of content (high motion, detailed textures, static, movie, animation, noisy content, and so on) and resolutions (typically UHD, HD, SD and down to a few lower resolutions). For every clip at every resolution, the following steps need to be carried out.
  3. 确定需要测试的应用程序配置和参数集(例如,实时、离线、GOP 设置等)。
  4. Decide the application configurations and parameter set that need to be tested (e.g., live, offline, GOP settings, and so on).
  5. 选择要测试的编码器/编解码器(例如,VP9 参考、H.264参考等)。
  6. Select the encoders/codecs to be tested (e.g., VP9 reference, H.264 reference, and so on).
  7. 确定需要测量的客观指标(例如, PSNR 、 SSIM等)。可以使用一个或多个度量。
  8. Determine the objective metric that needs to be measured (e.g., PSNR, SSIM, and so on). One or more metrics can be used.
  9. 确定合适的 QP 或比特率范围以运行测试。
  10. Determine the suitable QP or bit rate range to run the tests.
  11. 为每个编码器运行编写命令行脚本。尽管不同的编解码器可能无法使用相同的命令行配置,因为它们具有不同的工具集,但在此步骤中应注意保持测试尽可能相似。
  12. Write the command line scripts for each of these encoder runs. Although an identical command line configuration might not be possible across different codecs, as they have different tool sets, care should be taken in this step to keep the tests as similar as possible.
  13. 运行测试,最好使用自动化脚本。
  14. Run the tests, preferably using automated scripts.
  15. 获取 SSIM 等客观指标以及相应的输出文件大小。可以只使用亮度的度量值,或者如果愿意,也可以使用所有三个分量的平均值。使用输出文件大小对指标进行规范化,以得出比特率调整后的指标。
  16. Get the objective metrics like SSIM and also the corresponding output file sizes. The metric values for just luma can be used or, if preferred, an average of all three components can also be used. Normalize the metrics using the output file sizes to derive the bitrate-adjusted metrics.
  17. 绘制图表,最重要的是,观察趋势和变化以得出有用的结论。
  18. Draw the charts and most importantly, observe the trends and variations to derive useful conclusions.

在本节中,我们将使用一个示例来说明这是如何完成的。这里测试的目的和范围不是比较各种编解码器,而是说明可用于得出有用结论的测试过程。我们运行前面概述的两个编解码器,即 x264 和 x265。对于斯德哥尔摩[5 ] HD 720p 剪辑的前 200 帧,我们使用与最高质量(非常慢的设置)相对应的设置,在 1 Mbps 到 8 Mbps 范围内的几个选择比特率。该剪辑以 y4m 格式下载并使用 FFmpeg 转换为 yuv 格式

In this section we shall illustrate how this is done using one example. The objective and scope of tests here is not to compare the various codecs but to illustrate the testing procedures that can be used to derive useful conclusions. We run two of the codecs outlined earlier, namely, x264 and x265. We use the settings that correspond to highest quality (very slow settings) at a few select bitrates in the range from 1 Mbps to 8 Mbps for the first 200 frames of the stockholm [5] HD 720p clip. The clip was downloaded in y4m format and converted to yuv format using FFmpeg.

以下是用于所有比特率示例测试的命令行。

The following are the command lines used for the sample tests across all the bitrates.

x264 配置

x264 Configuration

./x264 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.264 --input-res 1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune ssim -- keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000

./x264 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.264 --input-res 1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune ssim --keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000

x265 配置

x265 Configuration

./x265 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.265 --input-res 1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune ssim -- keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000

./x265 720p5994_stockholm_ter.yuv --output 1000_720p5994_stockholm_ter.265 --input-res 1280x720 --seek 0 --frames 200 --input-depth 8 --ssim --preset veryslow --no-scenecut --tune ssim --keyint -1 --min-keyint 60 --fps 59.94 --bitrate 1000

表 15 整合了每次测试运行的 SSIM 和比特率。这在图 91 中以图形方式绘制。该图提供了 SSIM与 x264 和 x265 编码器的比特率的直观比较。

Table 15 consolidates the SSIMs and bitrates for each of the test runs. This is graphically plotted in Figure 91. The figure provides a visual comparison of SSIM vs. bitrates for both x264 and x265 encoders.

表 15:SSIM用于 x264 和 x265 编码。

Table 15: SSIM for x264 and x265 encoding.

图像

 

 

图像

图 91:比较 SSIMvs x264 和 x265 编码的比特率。

Figure 91: Comparison SSIM vs bit rates for x264 and x265 encoding.

我们看到 x265 曲线位于 x264 曲线的左侧。这意味着 x265 提供更高的 SSIM 值,因此在与 x264 相同的比特率下具有更高的感知视觉质量。此外,对于任何给定的 SSIM 值,最左边的 x265 曲线具有比 x264 曲线更低的对应比特率。可以通过在图表中画一条类似于穿过 10 dB SSIM 点的线的水平线来验证。这是跨不同编码器的相对压缩效率的度量。

We see that the x265 curve is to the left of the x264 curve. This means that x265 provides a higher value of SSIM, hence higher perceived visual quality at the same bitrate as x264. Also, for any given SSIM value the leftmost curve x265 has a lower corresponding bitrate than the x264 curve. It can be verified by drawing a horizontal line similar to the line across the 10 dB SSIM point in the chart. This is a measure of relative compression efficiency across different encoders.

12.4 概括
● 有两种评估输出视频质量的方法:1)客观分析和2)主观分析。
● 大多数客观 VQ 指标假设未失真的源可用于分析,这些指标称为完整参考(FR) 指标。这些包括 PSNR 、 SSIM和 VMAF
● PSNR是使用最广泛的客观指标,是源视频和编码视频之间的均方误差 (MSE) 值的简单函数。
● HVS高度专业化地提取结构信息。PSNR 等传统方法依赖于提取错误,而 SSIM 则侧重于提取结构信息。
● VMAF 的基础不是依赖单一的失真分析方法,而是结合使用主要的现有指标并结合机器学习工具。
● x264 和 x265 分别是广泛使用的可下载软件 H.264和 H.265编码器。vpxenc来自 WebM项目的是免费使用的 VP9 编码器。
● 存在一些比较 x264、x265 和 libvpx 的质量比较。x265 和 libvpx 都使用更新的工具,并且性能良好,明显优于前一代 H.264编码。
12.5 笔记
  1. Wang Z, Lu L, Bovik AC. 基于结构失真测量的视频质量评估。信号处理图像通信。2004;19(1):1-9。https://live.ece.utexas.edu/publications/2004/zwang_vssim_spim_2004.pdf。2018 年 9 月 22 日访问。
  2. Wang Z, Lu L, Bovik AC. Video quality assessment based on structural distortion measurement. Signal Process Image Commun. 2004;19(1):1-9. https://live.ece.utexas.edu/publications/2004/zwang_vssim_spim_2004.pdf. Accessed September 22, 2018.
  3. Liu T, Lin W, Kuo C. 使用多方法融合的图像质量评估。IEEE 跨图像处理。2013;22(5):1793-1807。https://www.researchgate.net/publication/234047751_Image_Quality_Assessment_Using_Multi-Method_Fusion。2018 年 9 月 22 日访问。
  4. Liu T, Lin W, Kuo C. Image quality assessment using multi-method fusion. IEEE Trans Image Process. 2013;22(5):1793-1807. https://www.researchgate.net/publication/234047751_Image_Quality_Assessment_Using_Multi-Method_Fusion. Accessed September 22, 2018.
  5. Li Z、Aaron A、Katsavounidis I 等人。迈向实用的感知视频质量公制。Netflix 技术博客。https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652。2016 年 6 月 6 日发布。2018 年 9 月 22 日访问。
  6. Li Z, Aaron A, Katsavounidis I, et al. Toward a practical perceptual video quality metric. The Netflix Tech Blog. https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652. Published June 6, 2016. Accessed September 22, 2018.
  7. Rassool R. VMAF 再现性:验证感知实用视频质量指标。真实网络。https://www.realnetworks.com/sites/default/files/vmaf_reproducibility_ieee.pdf。2018 年 10 月 22 日访问。
  8. Rassool R. VMAF Reproducibility: Validating a Perceptual Practical Video Quality Metric. Real Networks. https://www.realnetworks.com/sites/default/files/vmaf_reproducibility_ieee.pdf. Accessed October 22, 2018.
  9. 斯德哥尔摩。xiph.org. Xiph.org 视频测试媒体 [derf 的收藏]。https://media.xiph.org/video/derf/。2018 年 9 月 22 日访问。
  10. stockholm. xiph.org. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed September 22, 2018.

 

 

 

 

13 视频的进步

根据互联网上可用的报告,[ 1]到 2021 年,视频数据将消耗高达 82% 的互联网流量,其中直播和 VOD 视频、监控视频和 VR 视频内容推动了网络上的大部分视频流量。随着越来越多的用户增加他们的在线视频消费,以及 5G 等基础设施变革的出现、UHD 和更高分辨率的高质量视频体验,具有超低延迟的帧速率将很快成为日常现实。这将通过几个关键领域的进步相结合而成为可能,这些领域累积起来非常适合推动显着增长。在本章中,我们将重点关注以下三大领域:

According to reports available on the internet, [1] video data is poised to consume up to 82% of Internet traffic by 2021 with live and VOD video, surveillance video, and VR video content driving much of the video traffic over the web. As more users increase their online video consumption and with the advent of infrastructural changes like 5G, high-quality video experiences with UHD and higher resolutions, frame rates with ultralow latencies will soon be everyday realities. This will be possible with a combination of advances in a few key areas that, cumulatively, are well suited to drive significant growth. In this chapter we will focus on the following three broad areas:

● 机器学习和优化工具的进步被集成到现有的视频编码框架中,以使用经过验证和部署的编解码器实现压缩增益。
● 具有增强工具的较新压缩编解码器可满足即将到来的视频要求,例如更高的分辨率。我们将在即将推出的称为 AV1 的下一代编码标准中重点介绍编码工具
● VR 和 360 视频等较新的体验平台及其固有的视频要求是即将进行的研究的重要主题。
13.1 每标题编码器优化

Per-title 编码在概念上和实验阶段已经存在了好几年。Netflix 于 2015 年 12 月大规模部署了它,如 Netflix 技术博客文章所述[ 2]这也启发了这一部分。互联网流媒体视频服务传统上使用一组比特率分辨率对,也称为比特率阶梯。这是一个表,指定对于给定的编解码器,什么比特率足以用于任何固定分辨率。因此,阶梯还定义了从一种分辨率到另一种分辨率的比特率转换。例如,如果比特率阶梯在 1 Mbps 时定义为 1280x720p,在 500 kbps 时定义为 720x480p,那么只要比特率保持在 1 Mbps 左右及以上,流媒体就会使用 720p 编码流。当网络条件将可用比特率降至 1 Mbps 以下时,流式传输将使用 480p 版本。这种实现称为固定阶梯,因为用于每个比特率的分辨率始终是固定的。虽然这很容易实现,固定梯子可能并不总是最佳方法。例如,如果内容或场景很简单,纹理或运动较少,它仍将以固定比特率编码,这可能高于实际需要的比特率。相反,高度复杂的内容或场景可能需要比使用固定阶梯中的最高比特率分配的比特更多的比特。此外,对于给定的比特率,可以根据内容的复杂性,而不是固定的分辨率阶梯。例如,使用 2 Mbps 的 1280x720p 可以更好地编码复杂场景,而可以使用 1920x1080p 以相同的比特率对更简单的内容进行编码。因此,固定方法虽然提供了良好的质量,但不能保证任何特定内容在请求的比特率下的最佳质量。从这些示例中可以明显看出,传统固定比特率分辨率阶梯中缺少成功编码的关键是内容复杂性。

Per-title encoding has been around conceptually and in experimental stages for several years. It was deployed at scale by Netflix in December 2015 as outlined in a Netflix tech blog article [2] that has also inspired this section. Internet streaming video services traditionally use a set of bit rate-resolution pairs, a.k.a a bit rate ladder. This is a table that specifies, for a given codec, what bit rates are sufficient to use for any fixed resolution. The ladder also, therefore, defines at what bitrate transitions from one resolution to the other occur. For example, if the bitrate ladder defines 1280x720p at 1 Mbps and 720x480p at 500 kbps, then as long as the bitrate remains around 1 Mbps and above, the streaming would use 720p encoded stream. When the network conditions drop the available bitrate to below 1 Mbps, the streaming would be using the 480p version. This implementation is called a fixed ladder, as the resolution used for every bitrate is always fixed. While this is easy to implement, a fixed ladder may not always be the optimal approach. For example, if the content or scene is simple with less texture or motion, it will still be encoded at a fixed bitrate that may be higher than what it really needs. Conversely, high complex content or scenes may need more bits than what’s allocated using even the highest bitrate in the fixed ladder. Also, for a given bit rate a better resolution could be chosen based on the complexity of content instead of a fixed resolution ladder. For example, complex scenes can be better encoded using 1280x720p at 2 Mbps while easier content can be encoded using 1920x1080p at the same bitrate. Thus, the fixed approach, while providing good quality, cannot guarantee the best quality for any specific content at the requested bitrate. It is obvious from these examples that the key to successful encoding that’s missing in traditional fixed bitrate-resolution ladders is content complexity.

图像

图 92:PSNR- 以三种分辨率和各种比特率进行编码的比特率最佳曲线。

Figure 92: PSNR-bitrate optimal curve for encoding at three resolutions and various bitrates.

这个概念可以用下面的例子来解释,其中一个源以三种不同的分辨率编码,从较低的分辨率开始,然后在各种比特率上移动到更高分辨率。从改编自 Netflix 文章[2]的图 92 中,我们看到在每个分辨率下,质量增益在超过特定比特率阈值后开始减少。这意味着超过这一点,感知质量的提高可以忽略不计。如图 92 所示,此级别对于不同的分辨率是不同的。从该图表中可以清楚地看出,不同的分辨率在不同的比特率范围内是最佳的。这个最佳比特率显示在虚线曲线中,它是理想操作的点。选择接近此曲线的比特率分辨率对可产生最佳压缩效率。

The concept can be explained using the following example, where a single source is encoded at three different resolutions starting with lower and moving to higher resolutions across various bitrates. From Figure 92, adapted from the Netflix article, [2] we see that at each resolution, the quality gains start to diminish beyond a certain bitrate threshold. This means that beyond this point, the perceptual quality gains are negligible. This level, as seen in Figure 92, is different for different resolutions. It is clear from this chart that different resolutions are optimal at different bitrate ranges. This optimal bitrate is shown in the dotted curve that is the point of ideal operation. Selection of bitrate-resolution pairs close to this curve yields the best compression efficiency.

此外,图表是特定于内容的,此图表中的最佳比特率不一定适用于其他内容。Per-title 编码通过为每个标题选择接近该曲线的特定内容比特率分辨率对克服了固定阶梯带来的这个问题。为此,来自多个数据集的实验结果可用于根据不同的复杂性类型对源材料进行分类,并且可以根据内容分类为每个标题选择不同的比特率阶梯。

Also, the charts are content specific and the optimal bitrates from this chart will not necessarily be optimal for other content. Per-title encoding overcomes this problem posed by fixed ladders by choosing content-specific bitrate-resolution pairs close to this curve for every title. To do this, experimental results from several data sets can be used to classify source material based on different complexity types and different bitrate ladders can be chosen per-title based on their content classification.

13.2 机器学习

机器学习(ML) 是一种新时代的技术,它对包括视频编码在内的一切产生越来越大的影响。机器学习工具已经存在了几十年,从那时起,研究人员就开始研究分析数据并从中学习的方法,并使用分析结果构建模型并进行预测。简而言之,ML 算法将一组样本数据作为输入,分析数据样本,从中学习并根据所学知识对系统进行预测。随着越来越多的数据被输入和处理,输出结果作为反馈提供给算法。这会不断训练算法并提高预测模型的准确性。本次学习成为整个 ML 框架的一个关键方面。因此,在当今的数据时代,ML 方法提供了一种机制,可以仅通过处理大量数据并从观察到的输出中学习来对任何复杂场景进行建模。让我们在本节中了解如何以及在何处使用 ML 工具进行视频编码。

Machine learning (ML) is the new-age technology that increasingly has an impact on everything, including video coding. ML tools have been around for several decades, since the time when researchers began studying ways to analyze and learn from data, using results of the analysis to build models and make predictions. Simply put, ML algorithms take a set of sample data as input, analyze the data samples, learn from them and make predictions based on what is learned to model systems. As more data gets input and processed, the output results are provided as feedback to the algorithms. This continuously trains the algorithms and improves the accuracy of the prediction models. This learning becomes a key aspect of the overall ML framework. In today’s data age, the ML approach thus provides a mechanism to model any complex scenario solely by processing large volumes of data and learning from the observed outputs. Let us see in this section how and where ML tools can be used in video coding.

13.2.1 用于视频编码优化的ML 工具

可以使用一组数据构建和训练 ML 系统。然后,该系统可以建模并为未来的输入提供预测输出。虽然对于任何给定的输入集都可能有多种预测输出,但这些算法使用数学优化技术来最小化成本函数以获得最佳输出。因此,可以部署 ML 工具来为任何类型的优化问题提供解决方案,这包括视频编码。我们在第 10 章中确立了基本编码问题是如何成为优化问题的,其中之一是推导优化的视频质量对于给定的比特率,反之亦然。

An ML system can be built and trained using a set of data. This system can then model and provide predicted outputs for future inputs. While a variety of predicted outputs are possible for any given set of inputs, these algorithms use mathematical optimization techniques that minimize a cost function to get the best output. Thus, ML tools can be deployed to provide solutions for any kind of optimization problem and this includes video coding. We established in Chapter 10 how the fundamental encoding problem is an optimization problem, one of deriving an optimized video quality for a given bit rate and vice versa.

使用现有编解码器(如 H.264和 VP9)的视频优化变得越来越重要,因为越来越多的流媒体服务提供商将优化作为关键工具来从竞争对手中脱颖而出,并通过改善用户体验来吸引用户。视频编码优化驱动的比特率节省也意味着对现有基础设施和工作流程的改变很小。这些对任何视频服务提供商来说都是巨大的好处。

Video optimizations using existing codecs like H.264 and VP9 are becoming more important as more and more streaming service providers are using optimization as a key tool to differentiate themselves from competitors and engage their users by improving the user experience. Video coding optimization-driven bit rate savings also mean little change to existing infrastructure and workflows. These are huge benefits to any video service provider.

2018 年初,Netflix 宣布了他们的动态优化器实现[ 3]。他们的算法使用机器学习技术来分析每个帧,并根据内容使用压缩。VMAF质量指标用作优化目标,每个镜头都以不同的比特率编码以满足目标 VMAF 目标。在这种方法中,向用户展示了各种镜头,用户对它们的复杂性和内容进行了评分。然后使用这些测试 (MOS) 得出的主观分数来训练 ML 算法以对图片质量进行建模。这反过来又用于在逐个镜头的基础上优化视频序列的编码。据观察,使用这种方法,整体比特率大大降低,同时仍保持统一的质量水平,因为每个镜头都经过最佳编码。使用这种优化技术,Netflix 能够展示在一半带宽下看起来相同的流,尤其是在低比特率下。

Early in 2018, Netflix announced their dynamic optimizer implementation [3]. Their algorithm used machine learning techniques to analyze each frame and used compression based on the content. The VMAF quality metric was used as the optimization objective and every shot was encoded at different bitrates to meet the target VMAF objective. In this approach, a variety of shots were shown to users who rated them for complexity and content. The resulting subjective scores from these tests (MOS) were then used to train ML algorithms to model the picture quality. This was in turn used to optimize encoding of the video sequence on a shot-by-shot basis. It was observed that using this approach, the overall bit rate was considerably reduced while still maintaining a uniform quality level, as every shot was optimally encoded. Netflix was able to demonstrate streams that looked identical at half the bandwidth using this optimization technique, especially at low bitrates.

GF Escribano 及其同事还探索了使用 ML 工具降低编码复杂性的其他想法[ 4]。他们采用低复杂度的输入视频属性来训练 ML 算法,然后该算法可以对视频编码决策进行分类,并使用从输入视频派生的特征获得视频编码决策树。

Other ideas to reduce encoding complexity using ML tools have also been explored by G. F. Escribano and his colleagues [4]. They employed low-complexity input video attributes to train an ML algorithm that can then classify video coding decisions and obtain a video coding decision tree using features derived from input video.

图像

图 93:应用机器学习来构建模式决策树。

Figure 93: Applying machine learning to build mode decision trees.

视频编码算法中使用的常见属性包括统计像素均值、方差值、SAD 值等。然后,计算简单的编码决策树可以有效地补充或在某些情况下甚至取代计算昂贵的基于成本的决策。这里的基本挑战将围绕低复杂度输入视频属性的选择以及分类 ML 算法的选择。该过程如图 93 所示。无需补充,ML 算法输出的准确性将高度依赖于用于训练目的的输入数据集的选择。

Common attributes used in video coding algorithms include statistical pixel mean, variance values, SAD values and the like. The computationally simple coding decision trees can then effectively supplement or, in some cases, even replace computationally expensive cost-based decisions. The fundamental challenges here would be around the selection of the low complexity input video attributes and also the selection of the classification ML algorithms. The process is illustrated in Figure 93. Needless to add, the accuracy of the ML algorithm outputs would be highly dependent on the selection of the input data set that is used for training purposes.

13.3 新兴的 AV1编解码器

AV1是由开放媒体联盟(AOM) 开发的一种新兴、开放且免版税的标准,该联盟由视频行业的多家公司组成。AV1 格式基于 Google 的 VP10 项目,与 VP9 和 H.265 等现有编解码器相比,预计可将效率提高 25-40% 。它专为具有更高分辨率和帧速率以及 HDR功能的网络流媒体应用程序而设计。参考编码器可在线免费下载,提高计算效率以使编码运行速度更快的工作正在进行中。

AV1 is an emerging, open and royalty-free standard developed by the Alliance for Open Media (AOM), a consortium of several companies in the video industry. The AV1 format is based on Google’s VP10 project and is expected to produce anywhere from 25-40% efficiency improvements over existing codecs like VP9 and H.265. It is specifically designed for web streaming applications with higher resolutions and frame rates and HDR functions. The reference encoder is available for free download online and the work on improving the computational efficiency to make encoding runs faster is ongoing.

AV1使用相同的传统、基于块的混合模型,并具有多项增强功能,这些增强功能总体上提高了压缩效率。在本节中,我们将探讨 AV1 提供的增强功能和功能。表 16 最好地说明了这些增强功能以​​及它们与行业领先的 HEVC 和 VP9 标准的比较。

AV1 uses the same traditional, block-based, hybrid model with several enhancements that, in aggregate, account for its increased compression efficiency. In this section, we will explore the enhancements and features offered by AV1. Table 16 best illustrates these enhancements and how they compare to industry-leading HEVC and VP9 standards.

表 16:AV1 比较针对 HEVC 和 VP9 的工具和增强功能。

Table 16: Comparison of AV1 tools and enhancements against HEVC and VP9.

编码工具

Coding Tool

VP9

VP9

HEVC

HEVC

AV1

AV1

块大小

Block size

64x64 超级块

64x64 superblocks

可变 8x8 至 64x64 CTU

Variable 8x8 to 64x64 CTUs

128x128 超级块

128x128 superblocks

分区

Partitioning

从 64x64 分层到 4x4 的可变分区。

Variable partitions from 64x64 hierarchically down to 4x4.

从 64x64 分层到 4x4 的可变分区。

Variable partitions from 64x64 hierarchically down to 4x4.

支持额外的T 形分区方案。此外,还引入了用于预测的楔形分区,以实现沿非方形线更准确地分离对象。

Additional T-shaped partitioning schemes supported. Also, wedge-shaped partitioning for prediction is introduced to enable more accurate separation of objects along non-square lines.

转换

Transforms

可变 32x32 到 4x4 整数 DCTtransforms + 4x4 整数 DST 变换

Variable 32x32 down to 4x4 integer DCT transforms + 4x4 integer DST transform

可变 32x32 到 4x4 整数 DCTtransforms + 4x4 整数 DST 变换

Variable 32x32 down to 4x4 integer DCT transforms + 4x4 integer DST transform

可变 32x32 到 4x4 转换。DCT的矩形版本DST的非对称版本。结合两个 1-D 变换以使用不同的 H 和 V 变换。

Variable 32x32 down to 4x4 transforms. Rectangular versions of the DCT and asymmetric versions of the DST. combine two 1-D transforms to use different H & V transforms.

帧内预测

Intra prediction

10 种模式,包括用于方向预测的 8 个角度

10 modes including 8 angles for directional prediction

35方向预测

35 direction for prediction

65 个预测角度,包括加权预测。

65 angles for prediction and includes weighted prediction.

帧间预测

Inter prediction

没有加权预测支持

No weighted prediction support

支持时间加权预测。

Supports temporal weighted prediction.

支持空间重叠块运动补偿、全局 MV 和变形 MV。

Supports spatial overlapped block motion compensation, global MVs, and warped MVs.

亚像素插值

Sub pixel interpolation

支持更高精度⅛像素MV

Support for higher precision ⅛ pel MVs

¼ 像素八抽头滤波器 Y 和⅛ 像素四抽头 UV

¼ pixel eight-tap filter Y and ⅛ pixel four-tap UV

支持更高精度⅛像素MV

Support for higher precision ⅛ pel MVs

内部预测精度

Internal precision for prediction

支持可配置的 8 位和 10 位

Configurable 8-bit and 10-bit are supported

支持可配置的 8 位和 10 位

Configurable 8-bit and 10-bit are supported

更高精度的内部处理可配置为 10 或 12 位

Internal processing in higher precision configurable as 10 or 12 bits

过滤

Filtering

环路解块影响边缘两侧最多 7 个像素的过滤器。无 SAO 过滤器。

In-loop deblocking filter affecting up to 7 pixels on either side of the edges. No SAO filter.

环路解块影响边缘两侧最多 3 个像素的过滤器。具有样本和频带偏移模式的 SAO 滤波器。

In-loop deblocking filter affecting up to 3 pixels on either side of the edges. SAO filter with sample and band offset modes.

类似于解块的环路恢复滤波器筛选。还包括一个新的定向滤波器以消除振铃噪音。

Loop restoration filter similar to deblocking filter. Also includes a new directional filter to remove ringing noise.

熵编码

Entropy coding

二进制算术编码帧级适配

Binary arithmetic coding with frame level adaptation

二进制算术编码具有行级适配

Binary arithmetic coding with row level adaptation

非二进制算术编码具有符号级概率自适应。

Non-binary arithmetic coding with symbol level probability adaptation.

块跳过模式

Block skip modes

不支持跳过模式

No support for skip modes

合并模式

Merge modes

没有明确的跳过模式

No explicit skip modes

运动矢量预测

Motion vector prediction

4 种模式,包括三种模式,具有来自空间和时间邻居的隐式 MV 预测。

4 modes including three modes with implicit MV prediction from spatial and temporal neighbors.

增强的空间和时间预测

Enhanced spatial and temporal prediction

动态参考 MV 选择[ 5]

Dynamic reference MV selection [5]

并行工具

Parallelism tools

不支持 WPP,但支持列和行块,不支持跨列块的​​预测和跨块行的预测。

No WPP but column and row tiles are supported with no prediction across column tiles and prediction possible across tile rows.

波前并行处理, 瓷砖, 切片

Wavefront parallel processing, tiles, slices

不支持 WPP,但支持列和行磁贴,可配置预测跨磁贴行。

No WPP but column and row tiles are supported with configurable prediction across tile rows.

参考图片

Reference pictures

来自 8 个可用缓冲区的最多 3 帧

Up to 3 frames from 8 available buffers

最多 16 帧,具体取决于分辨率

Up to 16 frames depending on resolutions

来自 8 个可用缓冲区的最多 6 帧

Up to 6 frames from 8 available buffers

隔行扫描编码

Interlaced coding

仅支持帧编码。

Only frame coding is supported.

仅支持帧编码。

Only frame coding is supported.

仅支持帧编码。

Only frame coding is supported.

13.4 虚拟现实和 360° 视频

360° 视频(沉浸式或球形视频)和 VR 等新兴技术旨在在完全沉浸式环境中提供完全符合人类感知的数字体验。在常规视频或电影体验中,摄像机以 30fps 的速度为每个组件捕获大约每帧 800 万像素 (UHD) 可能完全足以提供良好的体验。然而,为了获得完全身临其境的体验,捕捉必须与人眼的捕捉速率相匹配,人眼不断接收光线并感知更快的运动,例如 90-150fps 之间。此外,由于如此近距离的视场更高,因此每一帧都需要处理更多的像素。据估计,人眼每两只眼睛可以接收到 7.2 亿像素。[ 6]以每像素 36 位和 60fps 计算,这总计达到 3.1 万亿太比特。即使使用 H.265 等现有压缩方案压缩 600 倍,也需要 5.2 Gbps。虽然计算可能完全准确,也可能不完全准确,但重要的是要了解,与传统视频体验相比,完全身临其境的体验是完全不同的球赛。此外,与将图像投影在平面上的传统视频不同,360 度视频会投影到观众周围的球面上。当使用围绕传统视频构建的传统编码工具压缩此类投影图像时,这会带来挑战。

Emerging technologies like 360° video (immersive or spherical video) and VR aim to provide a digital experience in a fully immersive environment with full fidelity to human perception. In a regular video or movie experience, cameras capturing roughly 8 million pixels per frame (UHD) for each component at 30fps is perhaps perfectly adequate for a good experience. However, for a completely immersive experience, the captures have to match the rates of capture by human eyes that receive light constantly and perceive much faster motion, say, between 90-150fps. Also, as the field of view at such close proximity is higher, a far greater number of pixels will need to be processed in every frame. By one estimate, the human eye can receive 720 million pixels for each of 2 eyes. [6] At 36 bits per pixel and 60fps, this adds up to 3.1 trillion terabits. Even when compressed 600 times by existing compression schemes like H.265, this would require 5.2 Gbps. While the calculations may or may not be completely accurate, what is important here is to understand that fully immersive experiences are a completely different ball game compared to traditional video experiences. Furthermore, unlike in traditional video where the image is projected on a flat plane, a 360 video is projected onto a spherical plane around the viewer. This presents challenges when such projected images are compressed using traditional coding tools that are built around conventional video.

13.4.1 什么是 360° 视频

在 360° 视频中,完成完整 360° 所需的每个方向或角度都会被同时捕获,可以只使用一个全向摄像机,也可以使用由多个摄像机组成的特殊装备。需要捕捉的角度水平范围为-180°至+180°,垂直范围为-90°至+90°。然后将来自所有这些角度的捕获拼接成一个最终视频,该视频提供了场景的完整球形视图。这可以直接在相机内部完成,也可以使用合并软件在外部完成。然后以单视场或立体方式投影。后者意味着不同的图像,每只眼睛一个。可以使用平板或头戴式显示器查看结果(头显)。任何时候都只能查看投影的一小部分(称为视口),并且呈现的部分会根据头部或设备的运动动态变化。此工作流如图 94 所示,其中使用当前一代编解码器(如 HEVC)对输入球形视频进行 2D 投影后获得的 2D 视频进行编码。

In 360° videos, every direction or angle that’s needed to complete the full 360° is simultaneously captured, either using only one single omnidirectional camera or a special rig of several cameras. The angles that need to be captured range horizontally from -180° to +180° and vertically from -90° to +90°. The captures from all these angles are then stitched to produce one final video that gives a complete spherical view of the scene. This can be done either directly within the camera itself or externally using merging software. It is then projected either monoscopically or stereoscopically. The latter means different images, one for each eye. The result can be viewed using a flat panel or a head-mounted display (HMD). Only a small section of the projection, called the viewport, is viewed at any time and the section being presented varies dynamically based on the head or device motion. This workflow is illustrated in Figure 94, where current generation codecs like HEVC are used to encode the 2D video that is obtained after 2D projection of the input spherical video.

标准投影格式的投影格式信息也可用于包含在 HEVC SEI 消息中。当接收到流视频时,使用标准的 HEVC 解码器对其进行解码,同时传播投影信息。投影信息与视口信息一起用于提取特定部分并将其球形显示在 HMD 上。

The projection format information for standard projection formats is also available to be included in HEVC SEI messages. When the streaming video is received, it is decoded using standard HEVC decoders that also propagate the projection information. The projection information, along with the viewport information, is used to extract the specific section and display it spherically on the HMD.

投影技术的选择非常重要,因为它决定了我们最终如何使用二维表面(平面)感知三维空间(球面)视图,并且有不同的方式可以做到这一点。流行的投影方案包括equirectangularcube mappyramid投影。其中,与等距柱状投影相比,立方体地图和金字塔投影方案被发现可以显着减少数据。

The selection of the projection technique is extremely important as it determines how we finally perceive the three-dimensional space (spherical) view using a two-dimensional surface (plane) and there are different ways in which this can be done. Popular projection schemes include equirectangular, cube map and pyramid projections. Amongst these, cube map and pyramid projection schemes have been found to provide significant data reduction compared to equirectangular projections.

Equirectangular 是最简单的投影。每条经线映射为固定间距的垂直直线,每条纬向圆映射为固定间距的水平直线。这会引入明显的失真,尤其是在两极。此外,它还会创建带有弯曲运动的扭曲图像。这对于采用直线运动矢量的基于块的压缩方案来说是不希望的。此外,等距柱状投影的顶部和底部极点包含处理起来非常昂贵的冗余数据。

Equirectangular is the simplest projection. Every longitudinal meridian is mapped to vertical straight lines with fixed spacing and every latitudinal circle is mapped to a horizontal straight line with fixed spacing. This introduces significant distortions, especially at the poles. Furthermore, it also creates warped images with bent motion. This is undesirable for block-based compression schemes that assume straight line motion vectors. Also, the top and bottom poles of the equirectangular projection contain redundant data that are expensive to process.

图像

图 94:当前使用 2D 视频编码的 360° 视频传输工作流程。

Figure 94: Current 360° video delivery workflow with 2D video encoding.

360 °图片来源:https://pixabay.com/en/winter-panorama-mountains-snow-2383930/

360° Image source: https://pixabay.com/en/winter-panorama-mountains-snow-2383930/

等距柱状投影的问题可以使用立方体贴图来解决。它们有六个独特的面,其中像素均匀分布,面本身没有任何弯曲变形。这也意味着没有包含任何冗余信息的极点。虽然立方体贴图似乎非常适合解决等距柱状投影的问题,但它们在所有角度(由立方体的面表示)上使用均等分布并且仍然消耗大量带宽,尤其是对于使用 360° VR 体验的耳机可以支持分辨率超过 4K。因此,对这整套数据进行编码和流式传输会产生两个问题。

The problems with equirectangular projections can be addressed using cube maps. These have six unique faces in which pixels are uniformly distributed without any bending distortions within the faces themselves. This also means that there are no poles containing any redundant information. While cube maps seem perfectly fine to solve problems with equirectangular projections, they use equal distribution across all angles (represented by the faces of the cube) and still consume a lot of bandwidth, especially for a VR experience using 360° where headsets can support resolutions beyond 4K. Encoding and streaming this whole set of data thus produces two problems.

1) 带宽非常大,通常为几百 Mbps。
2) 并非所有解码器都可以处理超高清以上的分辨率,尤其是用于移动 VR 耳机的移动设备。

虽然设备确实只能提供约 95° 的有限视野(FOV),而且大部分 360° 都是多余的,但在 VR 中,用户可以控制沉浸式体验,头部和眼球运动的每一个细微变化都会产生不同的效果需要以最高保真度呈现的观点。因此,这里的挑战是以某种方式传输完整的 FOV,同时优化直接 FOV 之外的视图中存在的重要冗余。

While it’s true that devices offer only a limited field of view (FOV) of ~95° and much of the 360° is redundant, in VR the user controls the immersive experience and every slight change in the head and eye movement results in a different view that needs to be presented with utmost fidelity. Thus, the challenge here is to somehow transmit the complete FOV yet optimize the significant redundancies that exist in views that are outside the immediate FOV.

因此,360° 视频在处理海量数据方面提出了重大挑战。由于压缩方案仍然依赖于围绕传统视频构建的传统基于块的方法,因此存在重要的研究机会来对完整的 360° 视野和相关运动进行建模,以开发适用于这些即将到来的应用程序和生态系统的更新预测模式。

Thus, 360° video poses significant challenges in processing the huge amounts of data. As compression schemes still rely on traditional block-based approaches that are built around conventional video, there are significant research opportunities to model the complete 360° field of view and associated motion to develop newer prediction modes that are suited for these upcoming applications and ecosystems.

联合视频专家组(JVET) 是 ISO-IEC 和 ITU-T 的共同努力。该团队现在正在研究 HEVC 编解码器的后续版本,称为通用视频编码或 VVC。它的目标是使用已知或新的投影格式对 360° 视频内容进行高效编码。该视频标准的目标是在 2020 年之前完成,还旨在使传统 2D 内容的比特率效率比 HEVC 高 50%。作为该标准的一部分提出的一些 360° 视频专用工具包括使用基于球形几何而不是常规 2D 矩形几何的参考样本进行预测,自适应量化基于球形像素密度等。压缩的好处将导致大规模部署所需的存储成本和带宽立即减少。

The Joint Video Experts Team (JVET) is a joint effort between ISO-IEC and ITU-T. The team is now working on the successor to the HEVC codec called Versatile Video Coding or VVC. It targets the efficient encoding of 360° video content with known or new projection formats. This video standard is targeted to be finalized by 2020 and also aims to achieve 50% bit rate efficiency over HEVC for traditional 2D content. Some 360° video-specific tools that have been proposed as part of this standard include prediction using reference samples based on spherical geometry instead of the regular 2D rectangular geometry, adaptive quantization based on spherical pixel density, and others. The compression benefits would result in an immediate reduction in storage costs and bandwidth reduction that will be needed for at-scale deployments.

13.5 概括

 

 

● Per-title 编码通过将内容复杂性作为确定特定内容将以何种比特率和分辨率进行编码的衡量标准,克服了固定分辨率编码阶梯的局限性。
● ML 预测算法的核心是数学优化技术,该技术可最小化内部成本函数以获得最佳预测输出。可以探索这一点以优化视频编码。
● AV1是由开放媒体联盟(AOM)开发的新的开放式免版税标准,它通过在具有多项增强功能的 VP9 的基础上构建,使用基于块的混合模型。总的来说,这些都是其提高压缩效率的原因。
● 360 度全景视频和虚拟现实等新兴技术具有更复杂的视觉场景,需要数个数量级的高视频吞吐量和低延迟。这些共同提出了重要的压缩要求,因此存在开发适合这些应用的压缩系统的机会。
13.6 笔记
  1. 思科视觉网络指数:预测和方法,2016-2021 年。思科。2017 年 6 月 6 日。https ://www.cisco.com/c/en/us/solutions/collat​​eral /service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.html 。2017 年 9 月 15 日更新。2018 年 9 月 22 日访问。
  2. Cisco Visual Networking Index: Forecast and Methodology, 2016–2021. Cisco. June 6, 2017. https://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/complete-white-paper-c11-481360.html. Updated September 15, 2017. Accessed September 22, 2018.
  3. Aaron A、Li Z、Manohara M 等人。每个标题编码优化。Netflix 技术博客。https://medium.com/netflix-techblog/per-title-encode-optimization-7e99442b62a2。2015 年 12 月 14 日发布。2018 年 9 月 22 日访问。
  4. Aaron A, Li Z, Manohara M, et al. Per-title encode optimization. The Netflix Tech Blog. https://medium.com/netflix-techblog/per-title-encode-optimization-7e99442b62a2. Published December 14, 2015. Accessed September 22, 2018.
  5. Jones A. Netflix 引入了动态优化器。大便。https://www.bizety.com/2018/03/15/netflix-introduces-dynamic-optimizer/。2018 年 3 月 15 日发布。2018 年 9 月 22 日访问。
  6. Jones A. Netflix introduces dynamic optimizer. Bizety. https://www.bizety.com/2018/03/15/netflix-introduces-dynamic-optimizer/. Published March 15, 2018. Accessed September 22, 2018.
  7. Escribano GF、Jillani RM、Holder C、Cuenca P. 使用机器学习进行视频编码和转码。MDM'08 第 9 届多媒体数据挖掘国际研讨会论文集:与 ACM SIGKDD 2008 联合举行。纽约州纽约市:会议 KDD'08 ACM;2008:53-62。https://www.researchgate.net/publication/234810713/download。2018 年 9 月 22 日访问。
  8. Escribano GF, Jillani RM, Holder C, Cuenca P. Video encoding and transcoding using machine learning. MDM'08 Proceedings of the 9th International Workshop Multimedia Data Mining: Held in Conjunction with the ACM SIGKDD 2008. New York, NY: Conference KDD'08 ACM; 2008:53-62. https://www.researchgate.net/publication/234810713/download. Accessed September 22, 2018.
  9. Massimino P. AOM - AV1它是如何工作的?AOM-AV1 视频技术聚会。https://parisvideotech.com/wp-content/uploads/2017/07/AOM-AV1-Video-Tech-meet-up.pdf。2017 年 7 月发布。2018 年 9 月 22 日访问。
  10. Massimino P. AOM - AV1 How does it work? AOM-AV1 Video Tech Meetup. https://parisvideotech.com/wp-content/uploads/2017/07/AOM-AV1-Video-Tech-meet-up.pdf. Published July, 2017. Accessed September 22, 2018.
  11. Begole, B. 为什么当虚拟现实起飞时互联网管道会破裂。福布斯谷之声。福布斯媒体有限责任公司。https://www.forbes.com/sites/valleyvoices/2016/02/09/why-the-internet-pipes-will-burst-if-virtual-reality-takes-off/#34c7f6e43858。2016 年 2 月 9 日发布。2018 年 9 月 22 日访问。
  12. Begole, B. Why the internet pipes will burst when virtual reality takes off. Forbes Valley Voices. Forbes Media LLC. https://www.forbes.com/sites/valleyvoices/2016/02/09/why-the-internet-pipes-will-burst-if-virtual-reality-takes-off/#34c7f6e43858. Published February 9, 2016. Accessed September 22, 2018.

资源

Resources

Bankoski J, Wilkins P, Xu X. VP8 的技术概述,一种用于网络的开源视频编解码器。谷歌公司http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37073.pdf。于 2018 年 9 月 23 日访问。

Bankoski J, Wilkins P, Xu X. Technical overview of VP8, an open source video codec for the web. Google, Inc. http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/37073.pdf. Accessed September 23, 2018.

Bultje RS。VP9 视频编解码器概述。随机想法博客,一般 [互联网]。https://blogs.gnome.org/rbultje/2016/12/13/overview-of-the-vp9-video-codec/。2016 年 12 月 13 日发布。2018 年 9 月 23 日访问。

Bultje RS. Overview of the VP9 video codec. Random Thoughts Blog, General [Internet]. https://blogs.gnome.org/rbultje/2016/12/13/overview-of-the-vp9-video-codec/. Published December 13, 2016. Accessed September 23, 2018.

指数哥伦布编码。维基百科。https://wikivisually.com/wiki/Exponential-Golomb_coding。2018 年 7 月 9 日更新。2018 年 9 月 23 日访问。

Exponential-Golomb Coding. Wikipedia. https://wikivisually.com/wiki/Exponential-Golomb_coding. Updated July 9, 2018. Accessed September 23, 2018.

Ghanbari M. 视频编码的历史。标准编解码器中的第 1 章: 图像压缩到高级视频编码。英国伦敦:电气工程师学会;2003. https://flylib.com/books/en/2.537.1/history_of_video_coding.html。于 2018 年 9 月 23 日访问。

Ghanbari M. History of video coding. Chapter 1 in Standard Codecs: Image Compression to Advanced Video Coding. London, England: Institution of Electrical Engineers; 2003. https://flylib.com/books/en/2.537.1/history_of_video_coding.html. Accessed September 23, 2018.

农庄 A,Alvestrand HT。VP9 比特流概述。网络工作组互联网草案。ITEF.org。https://tools.ietf.org/id/draft-grange-vp9-bitstream-00.html#rfc.section.2.6。2013 年 2 月 18 日发布。2018 年 9 月 23 日访问。

Grange A, Alvestrand HT. A VP9 bitstream overview. Network Working Group Internet-Draft. ITEF.org. https://tools.ietf.org/id/draft-grange-vp9-bitstream-00.html#rfc.section.2.6. Published February 18, 2013. Accessed September 23, 2018.

Grange A、de Rivaz P、Hunt J。VP9 比特流和解码过程规范 - v0.6。谷歌公司 webmproject.org。https://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf。2016 年 3 月 31 日发布。2108 年 9 月 23 日访问。

Grange A, de Rivaz P, Hunt J. VP9 bitstream and decoding process specification - v0.6. Google, Inc. webmproject.org. https://storage.googleapis.com/downloads.webmproject.org/docs/vp9/vp9-bitstream-specification-v0.6-20160331-draft.pdf. Published March 31, 2016. Accessed September 23, 2108.

Karwowski D、Grajek T、Klimaszewski K 等人。视频压缩 20 年的进步——从 MPEG-1 到 MPEG-H HEVC。对视频编码发展路径的总体看法。在 Choras RS 中,编辑。图像处理和通信挑战 8:第 8 届国际会议,IP&C 2016 比得哥什,波兰,2016 年 9 月会议记录。Cham,瑞士:Springer International Publishing;2017:3-15。https://www.researchgate.net/publication/310494503_20_Years_of_Progress_in_Video_Compression_-_from_MPEG-1_to_MPEG-H_HEVC_General_View_on_the_Path_of_Video_Coding_Development。于 2018 年 9 月 23 日访问。

Karwowski D, Grajek T, Klimaszewski K, et al. 20 years of progress in video compression – from MPEG-1 to MPEG-H HEVC. General view on the path of video coding development. In Choras RS, ed. Image Processing and Communications Challenges 8: 8th International Conference, IP&C 2016 Bydgoszcz, Poland, September 2016 Proceedings. Cham, Switzerland: Springer International Publishing; 2017:3-15. https://www.researchgate.net/publication/310494503_20_Years_of_Progress_in_Video_Compression_-_from_MPEG-1_to_MPEG-H_HEVC_General_View_on_the_Path_of_Video_Coding_Development. Accessed September 23, 2018.

Marpe D、Schwarz H、Wiegand T。H.264 /AVC 视频压缩标准中基于上下文的自适应二进制算术编码。IEEE 电路系统视频技术http://iphome.hhi.de/wiegand/assets/pdfs/csvt_cabac_0305.pdf。于 2018 年 9 月 23 日访问。

Marpe D, Schwarz H, Wiegand T. Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard. IEEE Circuits Syst Video Tech. http://iphome.hhi.de/wiegand/assets/pdfs/csvt_cabac_0305.pdf. Accessed September 23, 2018.

Melanson M. 视频编码概念:量化. 打破鸡蛋和制作煎蛋卷:多媒体技术和逆向工程专题;多媒体 Mike [互联网]。https://multimedia.cx/eggs/video-coding-concepts-quantization/。2005 年 4 月 5 日发布。2018 年 9 月 23 日访问。

Melanson M. Video coding concepts: Quantization. Breaking Eggs and Making Omelettes: Topics on Multimedia Technology and Reverse Engineering; Multimedia Mike [Internet]. https://multimedia.cx/eggs/video-coding-concepts-quantization/. Published April 5, 2005. Accessed September 23, 2018.

Mukherjee D、Bankoski J、Bultje RS 等。VP9 的技术概述——最新的开源视频编解码器。SMPTE 运动成像杂志。2015;124(1):44-54。https://www.researchgate.net/publication/272399193/download。2018 年 9 月 22 日访问。

Mukherjee D, Bankoski J, Bultje RS, et al. A technical overview of VP9—the latest open-source video codec. SMPTE Motion Imaging J. 2015;124(1):44-54. https://www.researchgate.net/publication/272399193/download. Accessed September 22, 2018.

Mukherjee D、Bankoski J、Bultje RS 等。VP9 的技术概述:来自 Google 的最新免版税视频编解码器。谷歌公司http://files.meetup.com/9842252/Overview-VP9.pdf。于 2018 年 9 月 23 日访问。

Mukherjee D, Bankoski J, Bultje RS, et al. A technical overview of VP9: The latest royalty-free video codec from Google. Google, Inc. http://files.meetup.com/9842252/Overview-VP9.pdf. Accessed September 23, 2018.

Ozer J. 发现与 Netflix VMAF 的明显区别。流媒体学习中心。streamslearningcenter.com. https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-subjective-ratings.html。2017 年 9 月 4 日发布。2018 年 9 月 23 日访问。

Ozer J. Finding the just noticeable difference with Netflix VMAF. Streaming Learning Center. streaminglearningcenter.com. https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-subjective-ratings.html. Published September 4, 2017. Accessed September 23, 2018.

Ozer J. 将 SSIM和 VMAF分数映射到主观评分。流媒体学习中心。streamslearningcenter.com. https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-subjective-ratings.html。2018 年 7 月 5 日发布。2018 年 9 月 23 日访问。

Ozer J. Mapping SSIM and VMAF scores to subjective ratings. Streaming Learning Center. streaminglearningcenter.com. https://streaminglearningcenter.com/learning/mapping-ssim-vmaf-scores-subjective-ratings.html. Published July 5, 2018. Accessed September 23, 2018.

Ozer J.按数字编码的视频:消除流媒体视频中的猜测。Galax, VA: Doceo Publishing, Inc.;2017.

Ozer J. Video Encoding By The Numbers: Eliminate the Guesswork from Your Streaming Video. Galax, VA: Doceo Publishing, Inc.; 2017.

彼得3d。VP9 的工作原理、技术细节和图表。Doom9 的论坛 [互联网]。 https://forum.doom9.org/showthread.php?t=168947。2013 年 10 月 8 日发布。2018 年 9 月 23 日访问。

pieter3d. How VP9 works, technical details & diagrams. Doom9’s Forum [Internet]. https://forum.doom9.org/showthread.php?t=168947. Published October 8, 2013. Accessed September 23, 2018.

编码器控制的率失真优化。弗劳恩霍夫电信研究所、海因里希赫兹研究所、HHI。https://www.hhi.fraunhofer.de/en/departments/vca/research-groups/image-video-coding/research-topics/rate-distortion-optimization-rdo-for-encoder-control.html。于 2018 年 9 月 23 日访问。

Rate distortion optimization for encoder control. Fraunhofer Institute for Telecommunications, Heinrich Hertz Institute, HHI. https://www.hhi.fraunhofer.de/en/departments/vca/research-groups/image-video-coding/research-topics/rate-distortion-optimization-rdo-for-encoder-control.html. Accessed September 23, 2018.

Riabtsev S. 视频压缩。www.ramugedia.comhttp://www.ramugedia.com/video-compression。于 2018 年 9 月 23 日访问。

Riabtsev S. Video Compression. www.ramugedia.com. http://www.ramugedia.com/video-compression. Accessed September 23, 2018.

Richardson I. 视频编码简史。2014 年 12 月 1 日至 5 日在美国专利贸易局 PETTP 2014 USPTO 技术周上的受邀演讲。SlideShare Technology。slideshare.net. https://www.slideshare.net/vcodex/a-short-history-of-video-coding。于 2018 年 9 月 23 日访问。

Richardson I. A short history of video coding. Invited talk at United States Patent and Trade Office, PETTP 2014 USPTO Tech Week, December 1-5, 2014. SlideShare Technology. slideshare.net. https://www.slideshare.net/vcodex/a-short-history-of-video-coding. Accessed September 23, 2018.

Richardson I, Bhat A. 视频编码标准和格式的历史时间表。视频编码器。 https://goo.gl/bqyyXd。于 2018 年 9 月 23 日访问。

Richardson I, Bhat A. Historical timeline of video coding standards and formats. Vcodex. https://goo.gl/bqyyXd. Accessed September 23, 2018.

理查森即。H.264和 MPEG-4 视频压缩:下一代多媒体的视频编码。英国西萨塞克斯郡:John Wiley & Sons, Ltd.;2003.

Richardson IE. H.264 and MPEG-4 Video Compression: Video Coding for Next-generation Multimedia. West Sussex, UK: John Wiley & Sons, Ltd.; 2003.

Robitza W. 了解速率控制模式(x264、x265、vpx)。SLHCK 信息。https://slhck.info/video/2017/03/01/rate-control.html。2017 年 3 月 1 日发布。2018 年 8 月更新。2018 年 9 月 23 日访问。

Robitza W. Understanding rate control modes (x264, x265, vpx). SLHCK.info. https://slhck.info/video/2017/03/01/rate-control.html. Published March 1, 2017. Updated August, 2018. Accessed September 23, 2018.

视频编码优化中的 Sonnati F. 人工智能。Video Encoding & Streaming Technologies,Fabio Sonnati 在视频传输和编码博客 [Internet] 上。https://sonnati.wordpress.com/2017/10/09/artificial-intelligence-in-video-encoding-optimization/。2017 年 10 月 9 日发布。2018 年 9 月 23 日访问。

Sonnati F. Artificial Intelligence in video encoding optimization. Video Encoding & Streaming Technologies, Fabio Sonnati on Video Delivery and Encoding Blog [Internet]. https://sonnati.wordpress.com/2017/10/09/artificial-intelligence-in-video-encoding-optimization/. Published October 9, 2017. Accessed September 23, 2018.

Sullivan GJ、Ohm J、Han W、Wiegand T. 高效视频编码 (HEVC) 标准概述。IEEE 跨电路系统视频技术。2012;22(12):1649-1668。https://ieeexplore.ieee.org/document/6316136/?part=1。2018 年 9 月 21 日访问。

Sullivan GJ, Ohm J, Han W, Wiegand T. Overview of the High Efficiency Video Coding (HEVC) Standard. IEEE Trans Circuits Syst Video Technol. 2012;22(12):1649-1668. https://ieeexplore.ieee.org/document/6316136/?part=1. Accessed September 21, 2018.

Urban J. 了解视频压缩伪像。biamp.com. 成分。Biamp 的博客 [互联网]。http://blog.biamp.com/understanding-video-compression-artifacts/。2017 年 2 月 16 日发布。2018 年 9 月 23 日访问。

Urban J. Understanding video compression artifacts. biamp.com. Component. Biamp's Blog [Internet]. http://blog.biamp.com/understanding-video-compression-artifacts/. Published February 16, 2017. Accessed September 23, 2018.

Vinayagam M. 下一代广播技术 - 视频编解码器。SlideShare 技术。slideshare.net. https://www.slideshare.net/VinayagamMariappan1/video-codecs-62801463。2016 年 6 月 7 日发布。2018 年 9 月 23 日访问。

Vinayagam M. Next generation broadcasting technology - video codec. SlideShare Technology. slideshare.net. https://www.slideshare.net/VinayagamMariappan1/video-codecs-62801463. Published June 7, 2016. Accessed September 23, 2018.

WebM博客。webmproject.org。http://blog.webmproject.org。于 2018 年 9 月 23 日访问。

WebM Blog. webmproject.org. http://blog.webmproject.org. Accessed September 23, 2018.

Wiegand T、Sullivan GJ、Bjøntegaard G、Luthra A. H.264 /AVC 视频编码标准概述IEEE 跨电路系统视频技术。2003;13(7):560-576。http://ip.hhi.de/imagecom_G1/assets/pdfs/csvt_overview_0305.pdf。2018 年 9 月 21 日访问。

Wiegand T, Sullivan GJ, Bjøntegaard G, Luthra A. Overview of the H.264/AVC Video Coding Standard. IEEE Trans Circuits Syst Video Technol. 2003;13(7):560-576. http://ip.hhi.de/imagecom_G1/assets/pdfs/csvt_overview_0305.pdf. Accessed September 21, 2018.

Wien M.高效视频编码:编码工具和规范。德国柏林和海德堡:施普林格出版社;2015.

Wien M. High Efficiency Video Coding: Coding Tools and Specification. Berlin and Heidelberg, Germany: Springer-Verlag; 2015.

xiph.org. [YUV 视频源]。Xiph.org 视频测试媒体 [derf 的收藏]。https://media.xiph.org/video/derf/。于 2018 年 9 月 23 日访问。

xiph.org. [YUV video sources]. Xiph.org Video Test Media [derf's collection]. https://media.xiph.org/video/derf/. Accessed September 23, 2018.

Ye Y. 360 度视频压缩的最新趋势和挑战。在 IEEE 多媒体和博览会国际会议 (ICME),第 9 届热门 3D 研讨会上发表主题演讲。InterDigital Inc. SlideShare 技术。slideshare.net. https://www.slideshare.net/YanYe5/recent-trends-and-challenges-in-360degree-video-compression。2018 年 8 月 1 日发布。2018 年 9 月 23 日访问。

Ye Y. Recent trends and challenges in 360-degree video compression. Keynote presentation at IEEE International Conference on Multimedia and Expo (ICME), 9th Hot 3D Workshop. InterDigital Inc. SlideShare Technology. slideshare.net. https://www.slideshare.net/YanYe5/recent-trends-and-challenges-in-360degree-video-compression. Published August 1, 2018. Accessed September 23, 2018.

Zhang H, Au OC, Shi Y, 等。改进了 HEVC 的样本自适应偏移2013年亚太信号与信息处理协会年会暨会议论文集IEEE。http://www.apsipa.org/proceedings_2013/papers/142_PID2936291.pdf 2013 年出版。2018 年 9 月 23 日访问。

Zhang H, Au OC, Shi Y, et al. Improved sample adaptive offset for HEVC. Proceedings of the 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE. http://www.apsipa.org/proceedings_2013/papers/142_PID2936291.pdf. Published 2013. Accessed September 23, 2018.

指数

INDEX

      

      

⅛ 像素精度

⅛ pixel precision

2个

2

二维对数搜索

2-D logarithmic search

3个

3

360 视频, 360 视频, 360 视频

360 Video, 360 Video, 360 Video

360° 视频, 360° 视频

360° Video, 360° Video

A

A

自适应量化,自适应量化,自适应量化

adaptive quantization, adaptive quantization, adaptive quantization

高级运动矢量预测

Advanced Motion Vector Prediction

开放媒体联盟,开放媒体联盟,开放媒体联盟,开放媒体联盟

Alliance for Open Media, Alliance for Open Media, Alliance for Open Media, Alliance for Open Media

算术编码算术编码算术编码算术编码,算术编码算术编码,算术编码,算术编码算术编码,算术编码算术编码,算术编码,算术编码算术编码算术编码算术编码算术编码,算术编码,算术编码,算术编码_

arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding, arithmetic coding

AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1 , AV1

AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1, AV1

B

向后适应

backward adaptation

带偏移,带偏移

Band Offset, Band Offset

双向预测

Bidirectional Prediction

二值化二值化二值化二值化二值化二值化,二值二值化二值化二值化二值化

binarization, binarization, binarization, binarization, binarization, binarization, binarization, binarization, binarization, binarization, binarization

位分配,位分配,位分配,位分配

Bit Allocation, Bit Allocation, Bit Allocation, Bit Allocation

缓冲容量_

buffer capacity, buffer capacity

缓冲区满度,缓冲区满度,缓冲区满度

buffer fullness, buffer fullness, buffer fullness

C

C

上限VBR

capped VBR

CBRCBRCBRCBRCBRCBRCBRCBR CBR,CBR ,CBRCBRCBRCBRCBRCBRCBRCBR

CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR, CBR

编解码器,编解码器,编解码器

Codecs, Codecs, Codecs

颜色模型颜色模型

color model, color model

色彩空间色彩空间色彩空间色彩空间色彩空间色彩空间色彩空间色彩空间

color space, color space, color space, color space, color space, color space, color space, color space

压缩效率,压缩效率

Compression efficiency, Compression efficiency

上下文自适应变长编码

context adaptive variable length coding

上下文建模,上下文建模,上下文建模,上下文建模,上下文建模

context modeling, context modeling, context modeling, context modeling, context modeling

CRF , CRF , CRF , CRF , CRF , CRF , CRF

CRF, CRF, CRF, CRF, CRF, CRF, CRF

D

DCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCTDCT

DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT, DCT

去块去块去块去块去块去块

deblocking, deblocking, deblocking, deblocking, deblocking, deblocking

去块

Deblocking

去块去块去块

deblocking, deblocking, deblocking

去块

Deblocking

去块

deblocking

去块

Deblocking

去块去块,去块,去块去块去块去块

deblocking, deblocking, deblocking, deblocking, deblocking, deblocking, deblocking, deblocking

去块

Deblocking

去块去块去块去块

deblocking, deblocking, deblocking, deblocking

解码器解码器解码器,解码器解码器解码器

decoders, decoders, decoders, decoders, decoders, decoders

钻石搜索

diamond search

差分编码,差分编码,差分编码,差分编码

differential coding, differential coding, differential coding, differential coding

离散余弦变换,离散余弦变换

discrete cosine transform, discrete cosine transform

离散正弦变换

discrete sine transform

失真测量失真测量失真测量失真测量失真测量

distortion measure, distortion measure, distortion measure, distortion measure, distortion measure

动态范围

Dynamic range

E

边缘偏移,边缘偏移

Edge Offset, Edge Offset

能量压实

Energy Compaction

,熵

entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy, entropy

熵编码,熵编码,熵编码

Entropy coding, Entropy coding, Entropy coding

等角投影

equirectangular projection

容错能力

Error resilience

穷举搜索穷举搜索穷举搜索

exhaustive search, exhaustive search, exhaustive search

Exp-Golomb , Exp-哥伦布

Exp-Golomb, Exp-Golomb

F

F

FFmpeg , FFmpeg , FFmpeg 格式

FFmpeg, FFmpeg, FFmpeg

视野,视野,视野

field of view, field of view, field of view

过滤强度

filter strengths

筛选优势

Filter Strengths

固定长度

Fixed Length

前向预测,前向预测,前向预测

forward prediction, forward prediction, forward prediction

完全参考完全参考完全参考

full reference, full reference, full reference

全搜索

Full search

H

H

H.264H.264H.264H.264H.264H.264H.264 H.264 H.264H.264H.264,H.264 ,H. 264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 ,H.264H.264H.264H.264H.264H.264H.264 H.264 H.264H.264H.264,H.264 ,H. 264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 , H.264 ,H.264 , H.264

H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264, H.264

H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H. 265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 ,H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H. 265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 ,H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265 , H.265

H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265, H.265

阿达玛变换,阿达玛变换,阿达玛变换

Hadamard transform, Hadamard transform, Hadamard transform

HDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDRHDR高动态范围,高动态范围,高动态范围,高动态光照渲染,高动态光照渲染

HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR, HDR

头戴式显示器

head-mounted display

霍夫曼编码,霍夫曼编码

Huffman coding, Huffman coding

HVS , HVS , HVS , HVS , HVS , HVS , HVS , HVS , HVS , HVS , HVS

HVS, HVS, HVS, HVS, HVS, HVS, HVS, HVS, HVS, HVS, HVS

I

环路滤波,环路滤波,环路滤波,环路滤波

in-loop filtering, in-loop filtering, in-loop filtering, in-loop filtering

信息论信息论信息论信息论信息论信息论信息论_ _ _ _

information theory, information theory, information theory, information theory, information theory, information theory

帧间

Inter frames

帧间预测,帧间预测

Inter prediction, Inter prediction

交错交错交错

Interlaced, Interlaced, Interlaced

插值插值插值插值插值插值,插值,插值插值插值插值

interpolation, interpolation, interpolation, interpolation, interpolation, interpolation, interpolation, interpolation, interpolation, interpolation

帧内编码_

intra frame coding, intra frame coding

帧内预测​​帧内预测​​帧内预测​​帧内预测​​帧内预测,​​帧内预测,​​帧内预测​​帧内预测​​帧内预测,​​帧内预测

Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction, Intra prediction

J

联合视频专家组,联合视频专家组

Joint Video Experts Team, Joint Video Experts Team

联合视频团队,联合视频团队

Joint Video Team, Joint Video Team

K

Karhunen-Loeve变换, Karhunen-Loeve变换

Karhunen-Loeve Transform, Karhunen-Loeve Transform

大号

L

拉格朗日乘数

Lagrange multiplier

拉格朗日,拉格朗日,拉格朗日

Lagrangian, Lagrangian, Lagrangian

潜伏

Latency

列表L0

List L0

亮度亮度亮度亮度亮度亮度

luminance, luminance, luminance, luminance, luminance, luminance

M

机器学习

Machine learning

平均差

mean average difference

平均意见得分

mean opinion score, mean opinion score

误差

mean squared error, mean squared error

合并模式

Merge Mode

元数据元数据元数据元数据元数据元数据

metadata, metadata, metadata, metadata, metadata, metadata

机器学习工具

ML Tools

模式决策,模式决策,模式决策,模式决策

mode decision, mode decision, mode decision, mode decision

运动补偿运动补偿运动补偿,运动补偿,运动补偿运动补偿运动补偿

motion compensation, motion compensation, motion compensation, motion compensation, motion compensation, motion compensation, motion compensation

运动矢量运动矢量运动矢量运动矢量运动矢量运动矢量,运动矢量运动矢量运动矢量,运动矢量,运动矢量,运动矢量,运动矢量运动矢量运动矢量运动矢量运动矢量,运动矢量

motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector, motion vector

运动矢量预测

Motion Vectors Prediction

N

近中压

Near MV

最近的MV

Nearest MV

新MV ,新MV

New MV, New MV

O

客观分析

Objective Analysis

越过高峰

over the top

P

P

佩尔佩尔佩尔

pel, pel, pel

Per-title encoding , Per-title 编码, Per-title 编码

Per-title encoding, Per-title encoding, Per-title encoding

进步

Progressive

PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR , PSNR

PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR, PSNR

Q

量化量化量化量化量化,量化,量化量化量化量化量化量化量化量化量化

Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization, Quantization

量化矩阵,量化矩阵

quantization matrix, quantization matrix

量化参数,量化参数

Quantization Parameter, Quantization Parameter

量化步长

Quantization Step Size

R

R

速率控制速率控制速率控制速率控制速率控制,速率控制速率控制速率控制速率控制速率控制,速率控制,速率控制,速率控制,速率控制速率控制速率控制,速率控制,速率控制,速率控制

rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control, rate control

率失真优化,率失真优化

rate-distortion optimization, rate-distortion optimization

重建重建重建重建,重建,重建重建重建重建重建重建重建重建重建,重建,重建重建,重建重建,重建,重建重建重建重建重建重建重建重建重建重建重建重建重建重建重建重建重建重建重建

reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed, reconstructed

参考帧参考帧,参考帧,参考帧参考帧参考参考参考帧,参考帧参考参考帧,参考帧,参考帧,参考帧,参考帧参考帧参考帧,参考系,参考系

reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames, reference frames

剩余剩余剩余剩余,剩余,剩余剩余剩余剩余,剩余,剩余剩余剩余,剩余,剩余,剩余,剩余剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余,剩余剩余剩余剩余剩余剩余剩余剩余剩余剩余剩余残差残差残差、残差、残差残差残差残差残差残差残差残差残差残差残差残差残差残差残差

residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual, residual

剩余价值

Residual values

红绿蓝,红绿蓝,红绿蓝

RGB, RGB, RGB

铃声铃声铃声

ringing, ringing, ringing

运行级别

Run Level

小号

S

样本自适应偏移,样本自适应偏移,样本自适应偏移

sample adaptive offset, sample adaptive offset, sample adaptive offset

抽样抽样抽样抽样抽样抽样抽样抽样抽样抽样

sampling, sampling, sampling, sampling, sampling, sampling, sampling, sampling, sampling, sampling

搜索区域搜索区域搜索区域搜索区域搜索区域

search area, search area, search area, search area, search area

搜索范围搜索范围搜索范围

search range, search range, search range

跳过模式

Skip Mode

切片

Slices

SSIMSSIMSSIMSSIMSSIMSSIMSSIMSSIMSSIMSSIM,SSIM SSIMSSIMSSIM

SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM, SSIM

统计复用

Statistical multiplexing

子分区

sub-partition

主观分析

Subjective Analysis

绝对差和,绝对差和,绝对差和

Sum of Absolute Differences, Sum of Absolute Differences, Sum of Absolute Differences

绝对变换差异之和

Sum of Absolute Transform Differences

超级块超级块超级块超级块超级块超级块超级块超级块超级块超级块超级块超级块超级块

superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock, superblock

T

三步搜索

three-step search

瓷砖_ _

Tiles, Tiles

传递函数传递函数传递函数传递函数传递函数

transfer function, transfer function, transfer function, transfer function, transfer function

变换变换变换变换变换

Transforms, Transforms, Transforms, Transforms, Transforms

截断一元

Truncated Unary

ü

U

无约束最小化

unconstrained minimization

V

V

变长编码

variable length coding

VBRVBRVBRVBRVBRVBR VBR ,VBRVBRVBRVBRVBRVBRVBRVBRVBRVBRVBRVBRVBRVBR

VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR, VBR

通用视频编码,通用视频编码

Versatile Video Coding, Versatile Video Coding

视频编码专家组,视频编码专家组

Video Coding Experts Group, Video Coding Experts Group

Video Multimethod Assessment Fusion ,视频多方法评估融合

Video Multimethod Assessment Fusion, Video Multimethod Assessment Fusion

视频点播

Video on demand

视频质量视频质量,视频质量视频质量视频质量视频质量,视频质量,视频质量,视频质量视频质量视频质量,视频质量,视频质量视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量,视频质量

video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality, video quality

虚拟现实,虚拟现实,虚拟现实

Virtual Reality, Virtual Reality, Virtual Reality

VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF , VMAF

VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF, VMAF

vpxencvpxencvpxencvpxenc

vpxenc, vpxenc, vpxenc, vpxenc

W

W

Wavefront parallel processing ,波前并行处理

Wavefront parallel processing, Wavefront parallel processing

WebM , WebM , WebM

WebM, WebM, WebM

加权预测

Weighted Prediction

Z

Z

零压

Zero MV